Over a million developers have joined DZone.

The New Spell Checker in Solr 4.0

· Java Zone

Navigate the Maze of the End-User Experience and pick up this APM Essential guide, brought to you in partnership with CA Technologies

One of the new features, which will be introduced in Solr 4.0 is a new SpellChecker implementation that doesn’t require its own index. I decided to take a quick look at it and share my thoughts.

What We Have Today

As for today (Solr 3.6) we can use the following SpellChecker implementations:

  • org.apache.solr.spelling.IndexBasedSpellChecker
  • org.apache.solr.spelling.FileBasedSpellChecker

With the upcoming Solr 4.0, we will get a new implementation:

  • org.apache.solr.spelling.DirectSolrSpellChecker


Current Problems

In most of the cases I worked with the main problem of IndexBasedSpellChecker was the need to rebuild its index. In some cases the rebuild was long and it wasn’t possible to rebuild that index after every commit which for some was a bit issue. Of course it wasn’t a problem with FileBasedSpellChecker, but again, in my case, it was used as a support mechanism for the IndexBasedSpellChecker.

Configuration

DirectSolrSpellChecker configuration is similar to the one you are used today in Solr 3. Of course, there are some additional parameters. Following you can find a sample configuration:

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
  <str name="queryAnalyzerFieldType">textTitle</str>
  <lst name="spellchecker">
    <str name="name">default</str>
    <str name="field">title</str>
    <str name="classname">solr.DirectSolrSpellChecker</str>
    <str name="distanceMeasure">internal</str>
    <float name="accuracy">0.7</float>
    <int name="maxEdits">2</int>
    <int name="minPrefix">1</int>
    <int name="maxInspections">5</int>
    <int name="minQueryLength">4</int>
    <float name="maxQueryFrequency">0.01</float>
    <float name="thresholdTokenFrequency">.01</float>
  </lst>
</searchComponent>

And the meaning for each of the parameters:

  • queryAnalyzerFieldType – name of the type on which basis SpellChecker query will be analyzed.
  • field – field which contents will be used to build SpellChecker results.
  • classname – SpellChecker implementation class.
  • distanceMeasure – algorithm which will be used to calculate terms distance, in our case we will use the default ones (Levensthein’s).
  • accuracy – precision that must be achieved for the suggest to be counted as proper one.
  • maxEdits – maximum number of changes during term enumeration. This property can be set to 1 or 2.
  • minPrefix – minimal, common prefix during term enumeration.
  • maxInspections – maximum number of checks for each suggestion.
  • minQueryLength – minimal suggestion length for work to be taken into consideration as proper suggestion.
  • maxQueryFrequency – maximum percentage of documents in which word can appear for the word to be considered as one to correct (0.01 value means 1%).
  • thresholdTokenFrequency -  minimal percentage of documents in which suggestion have to appear in order for it to be considered proper (.01 value means 1%).


The above configuration attributes shows that DirectSolrSpellChecker gives us much degree of behavior configuration.

Usage

DirectSolrSpellChecker is no different than other SpellChecker implementations when it comes to using it. As with the previous implementations you can configure Solr to add SpellChecker results to each query results or just configure new handler and decide when to query it for results. We wrote about how to use SpellChecker in the past – in the “Car sale application” example.

What We Can Expect ?

Acording to the information which we can see at JIRA issue LUCENE-2507 DirectSolrSpellChecker will not only remove the need of having a separate index, but will also improvement in suggestions quality. From what you can see in the mentioned JIRA issue, DirectSolrSpellChecker works better comparing to the previous implementations although it’s slightly slower, but I think that wont be an issue when you don’t use SpellChecker with every query.



Thrive in the application economy with an APM model that is strategic. Be E.P.I.C. with CA APM.  Brought to you in partnership with CA Technologies.

Topics:

Published at DZone with permission of Rafał Kuć, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}