Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter

DZone's Guide to

Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter

· Big Data Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

Originally written by Ted Sullivan.

In a previous blog post, I introduced the AutoPhrasingTokenFilter. This filter is designed to recognize noun-phrases that represent a single entity or ‘thing’.  In this post, I show how the use of this filter combined with a Synonym Filter configured to take advantage of auto phrasing, can help to solve an ongoing problem in Lucene/Solr – how to deal with multi-term synonyms.

The problem with multi-term synonyms in Lucene/Solr is well documented (see Jack Krupansky’s proposal, John Berryman’s excellent summary and Nolan Lawson’s query parser solution). Basically, what it boils down to is a problem with parallel term positions in the synonym-expanded token list – based on the way that the Lucene indexer ingests the analyzed token stream. The indexer pays attention to a token’s start position but does not attend to its position length increment. This causes multi-term tokens to overlap subsequent terms in the token stream rather than maintaining a strictly parallel relation (in terms of both start and end positions) with their synonymous terms. Therefore, rather than getting a clean ‘state-graph’, we get a pattern called “sausagination” that does not accurately reflect the 1-1 mapping of terms to synonymous terms within the flow of the text (see blog post by Mike McCandless on this issue). This problem disappears if all of the synonym pairs are single tokens. 

The multi-term synonym problem was described in a Lucene JIRA ticket (LUCENE-1622) which is still marked as “Unresolved”:

  • if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., “big” in the synonym “big apple” for “new york city”) causes the document to match;
  • there are problems with highlighting the original document when synonym is matched (see unit tests for an example),
  • if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won’t be found. Example “big apple” synonym for “new york city”. A phrase query “big apple restaurants” won’t match “new york city restaurants”

For some reason, the issue is marked as “Minor”!

The solution would seem to be to only use synonym expansion at query time, but this also has problems with phrase queries, incorrect boosting of rare synonyms due to IDF, and problems with matching multi-term synonyms – which tend to match more than they should (see above cited references). As we search wonks are wont to say, the Lucene/Solr synonyms solutions has problems with both precision and recall.

One solution to this problem is to avoid it altogether by making sure that the synonym list only contains single tokens. One suggested way to do this is to use one-way expansions such as big apple,new york city => nyc at both index and query time.  However, this doesn’t work since the query parser can’t ‘see’ beyond whitespaces (LUCENE-2605) so that a search for text:big apple gets converted to text:big text:apple and the expected synonym expansion doesn’t happen. It works if you search for text:”big apple”, but having to quote phrases to get their synonyms to work defeats the purpose of having synonyms for phrases in the first place. They should “just work” whenever a user enters the phrase in a query string.

From LUCENE-2605 (also currently Unresolved):

  • The queryparser parses input on whitespace, and sends each whitespace separated term to its own independent token stream.
  • This breaks the following at query-time, because they can’t see across whitespace boundaries:
    • n-gram analysis
    • shingles
    • synonyms (especially multi-word for whitespace-separated languages)
    • languages where a ‘word’ can contain whitespace (e.g. vietnamese)
  • Its also rather unexpected, as users think their charfilters / tokenizers / tokenfilters will do the same thing at index and querytime, but in many cases they can’t. Instead, preferably the queryparser would parse around only real ‘operators’. (Italics and boldface added)

This one at least is marked ‘Major’ but at the time of this writing is still unresolved (it was opened in 2010).

From this it would seem that the solution to the problem is to avoid multi-term synonyms altogether (if possible) as the underlying problem(s) seem to be intractable – or at least elusive.  When this happens in the software world where a bug fix does not appear to be imminent – we look instead for a … workaround!8-) This is where the AutoPhrasingTokenFilter comes in – by providing a way to convert multi-term phrases into single tokens, it can be used as a precursor to synonym mapping.  The solution has a number of side benefits – it preserves phrase searching and cross phrase searches like ‘big apple restaurants’.  It preserves highlighting and it works at either index or query time (if you are worried about the IDF issue). Why? Because rather than going for a solution of the root problem – it simply avoids it!  In other words, “If you can’t beat ‘em, join ‘em”.

Fixing the LUCENE-1622 problem with the Auto Phrasing TokenFilter

The exact use case described in LUCENE-1622 can be “fixed” by noticing that the phrases “Big Apple” and “New York City” are meant to represent a single entity – the great City of New York (another possible synonymous phrase). As described in the previous post, the AutoPhrasingTokenFilter can be used to detect these phrases in a token stream and convert them to single tokens. To preserve character position, a new attribute: replaceWhitespaceWith was added so that the length of the autophrased token will equal the original phrase length but it will not be split by the query parser – because it now has no whitespace characters in it.  Replacing white space with another character in the indexed data also helps with highlighting – which depends on character positions. The source code for this filter is available on github.

So if we have an autophrases.txt file consisting of:

big apple
new york city
city of new york
new york new york
new york ny
ny city
ny ny
new york

Once we configure the AutophrasingTokenFilter to replace whitespace characters with an underscore character (see configuration below), we can create a synonyms.txt entry like this:

big_apple,new_york_city,city_of_new_york,new_york_new_york,new_york_ny,ny_city,ny_ny,nyc

(Note that the use of the ‘_’ character will break stemming filters so you should probably use a letter such as ‘x’ but the underscore is used here for the sake of clarity)

Note that the ‘of’ in the phrase ‘City of New York’ is normally considered to be a stopword. However, if we put the AutoPhrasing Filter before the StopFilter, it will ‘hide’ the stopword so that it can be used in the phrase. This is useful for cases where we have stop words that contained in phrases but otherwise should be treated as noise words. 

The configuration of the text analyzer looks like this. Note that I put the AutoPhrasingTokenFilter in the index analyzer only (with includeTokens=true so that single term queries and sub phrases will continue to hit). Putting auto phrasing in the query analyzer has no effect because of LUCENE-2605.  The SynonymFilter is also in the index analyzer only. It can also go in the query analyzer if you want – this is better if your synonyms list changes often but it does incur the IDF problem:

<fieldType name="text_autophrase" class="solr.TextField" 
           positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory" 
            phrases="autophrases.txt" includeTokens="true"
            replaceWhitespaceWith="_" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"    
            ignoreCase="true" expand="true" />
    <filter class="solr.KStemFilterFactory" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.KStemFilterFactory" />
  </analyzer>
</fieldType>

Fixing the LUCENE-2605 problem:

Fixing the problem identified in LUCENE-2605 requires a little more work. Because the query parser only sends tokens to the query analyzer one at at time, there is no way to glue them together in the Analyzer’s token filter chain (even though the Solr Analysis console suggests that you can!). The solution is to do auto phrasing at query time before sending the query to the query parser. A QParserPlugin wrapper that preserves query syntax while auto phrasing the query ‘in place’ before passing it off to a ‘real’ query parser implementation does the trick. In other words, it does something similar to what was proposed in LUCENE-2605 by filtering “around” the query operators. The AutoPhrasingQParserPlugin uses the AutoPhrasingTokenFilter internally. Since this is a query parser, it requires a separate configuration in solrconfig.xml:

<requestHandler name="/autophrase" class="solr.SearchHandler">
   <lst name="defaults">
     <str name="echoParams">explicit
     <int name="rows">10
     <str name="df">text
     <str name="defType">autophrasingParser
   </lst>
  </requestHandler>

  <queryParser name="autophrasingParser" 
               class="com.lucidworks.analysis.AutoPhrasingQParserPlugin" >
    <str name="phrases">autophrases.txt
    <str name=”replaceWhitespaceWith”>_
  </queryParser>

To test the use case identified in LUCENE-1622, several test documents were created and indexed into a Solr collection (Can you spot the theme here? New Yorkers like myself are chauvinists :-) )

  <doc>
    <field name="id">1001</field>
    <field name="name">Doc 1</field>
    <field name="text">Example from LUCENE-1622 search for New York City restaurants</field>
  </doc>
  <doc>
    <field name="id">1002</field>
    <field name="name">Doc 2</field>
    <field name="text">There are many fine restaurants in the great City of New York.</field>
  </doc>
  <doc>
    <field name="id">1003</field>
    <field name="name">Doc 3</field>
    <field name="text">Multi-term synonyms in Solr is a big problem, and its not a new one.</field>
  </doc>
  <doc>
    <field name="id">1004</field>
    <field name="name">Doc 4</field>
    <field name="text">The empire state, New York State is a big state. There are many things to do in the State of New York.</field>
  </doc>
  <doc>
    <field name="id">1005<field>
    <field name="name">Doc 5</field>
    <field name="text">Many people like to visit the Big Apple, but they wouldn't want to live there.</field>
  </doc>
  <doc>
    <field name="id">1006</field>
    <field name="name">Doc 6</field>
    <field name="text">I like New York, New York its a hell of a town - the West Side's up and the Battery's down!</field>
  </doc>
  <doc>
    <field name="id">1007</field>
    <field name="name">Doc 7</field>
    <field name="text">I have a nice house near New Paltz. New Paltz has some nice restaurants and apple orchards too.</field>
  </doc>
  <doc>
    <field name="id">1008</field>
    <field name="name">Doc 8</field>
    <field name="text">As a New York baseball fan, you can root for the Yankees or you can root for the Mets. You can't root for both.</field>
  </doc>
  <doc>
    <field name="id">1009</field>
    <field name="name">Doc 9</field>
    <field name="text">The capital of New York is Albany.</field>
  </doc>
  <doc>
    <field name="id">1010</field>
    <field name="name">Doc 10</field>
    <field name="text">The Grand Old Duke of York, he had ten thousand men. He marched them up to the top of the hill and he marched them down again.</field>
  </doc>
  <doc>
    <field name="id">1011</field>
    <field name="name">Doc 11</field>
    <field name="text">There are some great parks in NYC, including Central Park and Riverside Park.</field>
  </doc>
  <doc>
    <field name="id">1012</field>
    <field name="name">Doc 12</field>
    <field name="text">It would be nice to live at 123 Broadway, NY, NY 10013.</field>
  </doc>
</add>

Query Tests: Comparing OOB behavior with auto phrasing:

Since the city of New York is in a State of the same name, queries for ‘New York’ are ambiguous and should return both. Out of the box (‘/select?q=New+York’), Solr will also return documents that have the single terms ‘new’ and ‘york’ in them as well. That is, consider the two documents about the ‘Grand Old Duke of York’ and ‘Multi-term synonyms in Solr’ that are returned in the result set below. They hit because they have the terms ‘new’ and or ‘york’ in them but are not really relevant to the probable intent of the query. Furthermore, there are documents about New York that are missing because they use synonyms for New York City. So in this case, the OOTB SearchHandler suffers from both precision and recall errors.

"response": {
  "numFound": 9,
  "start": 0,
  "docs": [
    {
      "id": "1009",
      "text": "The capital of New York is Albany."
    },
    {
      "id": "1006",
      "text": "I like New York, New York its a hell of a town - the West Side's up and the Battery's down!"
    },
    {
      "id": "1002",
      "text": "There are many fine restaurants in the great City of New York."
    },
    {
      "id": "1004",
      "text": "The empire state, New York State is a big state. There are many things to do in the State of New York."
    },
    {
      "id": "1001",
      "text": "Example from LUCENE-1622 search for New York City restaurants"
    },
    {
      "id": "1008",
      "text": "As a New York baseball fan, you can root for the Yankees or you can root for the Mets. You can't root for both."
    },
    {
      "id": "1010",
      "text": "The Grand Old Duke of York, he had ten thousand men."
    },
    {
      "id": "1007",
      "text": "I have a nice house near New Paltz. New Paltz has some nice restaurants and apple orchards too."
    },
    {
      "id": "1003",
      "text": "Multi-term synonyms in Solr is a big problem, and its not a new one."
    }
  ]
}

With the auto phrasing filter in place, searching for New York (/autophrase?q=New+York) only returns documents containing that phrase (i.e. contained in both New York City and New York State), excluding records that contain synonyms like NYC or Big Apple:

"response": {
    "numFound": 6,
    "start": 0,
    "docs": [
      {
        "id": "1009",
        "name": "Doc 9",
        "text": "The capital of New York is Albany.",
        "_version_": 1473362972290056200
      },
      {
        "id": "1002",
        "name": "Doc 2",
        "text": " The are many fine restaurants in the great City of New York.",
        "_version_": 1473362972282716200
      },
      {
        "id": "1004",
        "name": "Doc 4",
        "text": "The empire state, New York State is a big state. There are many things to do in the State of New York.",
        "_version_": 1473362972284813300
      },
      {
        "id": "1001",
        "name": "Doc 1",
        "text": "Example from LUCENE-1622 search for New York City restaurants",
        "_version_": 1473362972255453200
      },
      {
        "id": "1006",
        "name": "Doc 6",
        "text": "I like New York, New York its a hell of a town - the West Side's up and the Battery's down!",
        "_version_": 1473362972285862000
      },
      {
        "id": "1008",
        "name": "Doc 8",
        "text": "As a New York baseball fan, you can root for the Yankees or you can root for the Mets. You can't root for both.",
        "_version_": 1473362972289007600
      }
    ]
  }

And searching for New York City (/autophrase?q=new+york+city) or any of its synonyms ( big apple, city of new york, nyc, etc.) only return records that contain records about the New York City. Note that records about New York State or the baseball teams are correctly excluded:

"response": {
  "numFound": 6,
  "start": 0,
  "docs": [
    {
      "id": "1002",
      "text": "There are many fine restaurants in the great City of New York."
    },
    {
      "id": "1001",
      "text": "Example from LUCENE-1622 search for New York City restaurants"
    },
    {
      "id": "1005",
      "text": "Many people like to visit the Big Apple, but they wouldn't want to live there."
    },
    {
      "id": "1006",
      "text": "I like New York, New York its a hell of a town - the West Side's up and the Battery's down!"
    },
    {
      "id": "1011",
      "text": "There are some great parks in NYC, including Central Park and Riverside Park."
    },
    {
      "id": "1012",
      "text": "It would be nice to live at 123 W Broadway, NY, NY 10013. "
    }
  ]
}

Finally, getting back to the original use case reported in LUCENE-1622 the boolean search for any synonym of NYC AND restaurants such as big apple AND restaurants (or +big apple +restaurants) will only return records about the New York City restaurant scene:

"response": {
  "numFound": 2,
  "start": 0,
  "docs": [
    {
      "id": "1002",
      "text": "There are many fine restaurants in the great City of New York."
    },
    {
      "id": "1001",
      "text": "Example from LUCENE-1622 search for New York City restaurants"
    }
  ]
}

Conclusion

The AutoPhrasingTokenFilter can be an important tool in solving one of the more difficult problems with Lucene/Solr search – how to deal with multi-term synonyms. Simultaneously, we can improve another serious problem that all search engines have – their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”), the search engine is better able to return results based on ‘what’ the user is looking for rather than documents containing words that match the query. We are moving from searching with a “bag of words” to searching a “bag of things”.

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.

Topics:

Published at DZone with permission of Yonik Seeley. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}