scalability - Develop a distributed Full-Text search Index (AKA Inverted index) -
i know how develop simple inverted index on single machine. in short standard hash table kept in-memory where: - key - word - value - list of word locations example, code here: http://rosettacode.org/wiki/inverted_index#java
question:
now i'm trying make distributed among n nodes , in turn:
- make index horizontally scalable
- apply automatic sharding index.
i'm interested in automatic sharding. ideas or links welcome!
thanks.
sharding self quite complex task not solved in modern dbs. typical problems in distributed dbs cap theorem, , other low-level , quite challenging tasks rebalancing cluster data after adding new blank node or after naturally-occured imbalance in data.
the best data distribution implemented in db i've seen in cassandra. full text search not yet implemented in cassandra, might consider building distributed index upon it.
some other implemented options elasticsearch , solrcloud. in example given 1 important detail missing word-stemming. word stemming search form of word "sing", "sings", "singer". lucene , 2 previous solutions have implemented majority of languages.
Comments
Post a Comment