Solr has a number of Autocomplete implementations that are great for most purposes. However, a client of mine recently had some fairly specific requirements for Autocomplete:
1. Phrase-based substring matching
2. Out-of-order matches ('foo bar' should match 'the bar is foo')
3. Fallback matching to a secondary field when substring matching on the primary field fails, e.g., 'windstopper jac' doesn't match anything on the 'title' field, but matches on the 'category' field
The most direct way to model this would probably have been to create a separate Solr core and use n-gram plus shingles indexing, along with Solr queries, to obtain results. However, because the index was fairly small, I decided to go with an in-memory approach.
The general strategy was:
1. For each entry in the primary field, create n-gram tokens, adding
entries to a Guava Table where key is n-gram, column is string and
value is a distance score.
2. For each entry in the secondary field, create n-gram tokens and add entries to a Guava Multimap where key is n-gram and value is term.
3. When an Autocomplete query is received, split it by space, then do look-ups against the primary table.
4. If no matches are found, look-up against the secondary Multimap.
5. Return results.
The scoring for the primary table was a simple one based on length of word and distance of token from the start of the string.