Over a million developers have joined DZone.

“Car sale application” – solr.ReversedWildcardFilter – let’s optimize wildcard queries (part 8)

· Java Zone

Microservices! They are everywhere, or at least, the term is. When should you use a microservice architecture? What factors should be considered when making that decision? Do the benefits outweigh the costs? Why is everyone so excited about them, anyway?  Brought to you in partnership with IBM.

“Car sale application” users started to use wildard queries more and more often. This fact forced us to think about wildcard queries optimization. solr.ReversedWildcardFilter comes to rescue us.

solr.ReversedWildcardFilter

The solr.ReversedWildcardFilter filter provides us with new tokens, which in fact are reverses tokens, that are indexed to provide faster leading wildcard queries. The filter supports the following init arguments:

  • withOriginal – if true, then produce both original and reversed tokens at the same positions. If false, then produce only reversed tokens.
  • maxPosAsterisk – maximum position (1-based) of the asterisk wildcard (‘*’) that triggers the reversal of query term. Asterisk that occurs at positions higher than this value will not cause the reversal of query term.
  • maxPosQuestion – maximum position (1-based) of the question mark wildcard (‘?’) that triggers the reversal of query term.
  • maxFractionAsterisk – additional parameter that triggers the reversal if asterisk (‘*’) position is less than this fraction of the query token length.
  • minTrailing – minimum number of trailing characters in query token after the last wildcard character. For good performance this should be set to a value larger than 1.


schema.xml changes

New filter is added to the “text” field type:

<fieldType name="text" class="solr.TextField"
	positionIncrementGap="100">
	<analyzer type="index">
		<tokenizer class="solr.WhitespaceTokenizerFactory" />
		<filter class="solr.PatternReplaceFilterFactory" pattern="'"
			replacement="" replace="all" />
		<filter class="solr.WordDelimiterFilterFactory"
			generateWordParts="1" generateNumberParts="1" catenateWords="1"
			stemEnglishPossessive="0" />
		<filter class="solr.LowerCaseFilterFactory" />
		<strong><filter class="solr.ReversedWildcardFilterFactory" /></strong>
	</analyzer>
	<analyzer type="query">
		<tokenizer class="solr.WhitespaceTokenizerFactory" />
		<filter class="solr.PatternReplaceFilterFactory" pattern="'"
			replacement="" replace="all" />
		<filter class="solr.WordDelimiterFilterFactory"
			generateWordParts="1" generateNumberParts="1" catenateWords="1"
			stemEnglishPossessive="0" />
		<filter class="solr.LowerCaseFilterFactory" />
	</analyzer>
</fieldType>

solr.ReversedWildcardFilterFactory filter is added only to the index analyzer. We do not define any arguments in the filter definition, because we would like to use the default configuration, which is:

  • withOriginal – „true”, we would like to produce original tokens
  • maxPosAsterisk – 2
  • maxPosQuestion – 1
  • maxPosQuestion – 0.0f (disabled)
  • maxPosQuestion – 2


Sample data

Let’s index some sample data:

<add>
  <doc>
    <field name="id">1</field>
    <field name="make">Lancia</field>
    <field name="model">Delta</field>
    ...
  </doc>
  <doc>
    <field name="id">2</field>
    <field name="make">Land Rover</field>
    <field name="model">Defender</field>
    ...
  </doc>
  <doc>
    <field name="id">3</field>
    <field name="make">Acura</field>
    <field name="model">MDX</field>
    ...
  </doc>
  <doc>
    <field name="id">4</field>
    <field name="make">Acura</field>
    <field name="model">RDX</field>
    ...
  </doc>
  <doc>
    <field name="id">5</field>
    <field name="make">Acura</field>
    <field name="model">RSX</field>
    ...
  </doc>
</add>

Let’s create queries

Let me remind you that the default search field is the “content” field, that among others contains “make” and “model” field. To analyse query results and solr.ReversedWildcardFilter filter behaviour, we will set the „stored” argument of the „content” field to “true”. We will also add the debugQuery query argument, which will allow us to find out, which tokens are used in the query processing (original or reversed).

?q=lan*&fl=id,content&debugQuery=on
<result name="response" numFound="2" start="0">
  <doc>
    <arr name="content">
      <str>Lancia</str>
      <str>Delta</str>
      <str>2002</str>
    </arr>
    <str name="id">1</str>
  </doc>
  <doc>
    <arr name="content">
      <str>Land Rover</str>
      <str>Defender</str>
      <str>2002</str>
    </arr>
    <str name="id">2</str>
  </doc>
</result>
<lst name="debug">
  <str name="rawquerystring">lan*</str>
  <str name="querystring">lan*</str>
  <str name="parsedquery">content:lan*</str>
  <str name="parsedquery_toString">content:lan*</str>
  ...
</lst>

We have used asterisk wildcard (‘*’) at the end of the query (position = 4), so the original tokens were used:

<str name="parsedquery">content:lan*</str>

2.   ?q=*dx&fl=id,content&debugQuery=on

<result name="response" numFound="2" start="0">
  <doc>
    <arr name="content">
      <str>Acura</str>
      <str>MDX</str>
      <str>2002</str>
    </arr>
    <str name="id">3</str>
  </doc>
  <doc>
    <arr name="content">
      <str>Acura</str>
      <str>RDX</str>
      <str>2003</str>
    </arr>
    <str name="id">4</str>
  </doc>
</result>
<lst name="debug">
  <str name="rawquerystring">*dx</str>
  <str name="querystring">*dx</str>
  <str name="parsedquery">content:#1;xd*</str>
  <str name="parsedquery_toString">content:#1;xd*</str>
  ...
</lst>

We have used asterisk wildcard (‘*’) at the beginning of the query (position = 1) and additionally we have two trailing characters after the last wildcard. That’s why the revesed tokens were used:

<str name="parsedquery">content:#1;xd*</str>

As we can see, the reversed tokens have a special prefix in order to avoid collisions and false matches.

3.  ?q=r?x&fl=id,content&debugQuery=on
<result name="response" numFound="2" start="0">
  <doc>
    <arr name="content">
      <str>Acura</str>
      <str>RDX</str>
      <str>2003</str>
    </arr>
    <str name="id">4</str>
  </doc>
  <doc>
    <arr name="content">
      <str>Acura</str>
      <str>RSX</str>
      <str>2006</str>
    </arr>
    <str name="id">5</str>
  </doc>
</result>
<lst name="debug">
  <str name="rawquerystring">r?x</str>
  <str name="querystring">r?x</str>
  <str name="parsedquery">content:r?x</str>
  <str name="parsedquery_toString">content:r?x</str>
  ...
</lst>

We have used question mark wildcard (‘?’) on position number 2 and additionally we have only one trailing character after the wildcard. The original tokens were used:

<str name="parsedquery">content:r?x<</str>

The end

Thanks to the solr.ReversedWildcardFilter filter, we have successfully optimized wildcard queries. “Car sale application” users can now effectively use them :)

 

Discover how the Watson team is further developing SDKs in Java, Node.js, Python, iOS, and Android to access these services and make programming easy. Brought to you in partnership with IBM.

Topics:

Published at DZone with permission of Rafał Andrzejewski, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}