Platinum Partner
java,solr,apache

“Car sale application” – Unicode Collation, sorting text in a language-sensitive way (part 4)

In the third part of our ”Car sale” application related posts we added some location data and the information about the city that is related to every car. Shortly afterwards we added the possibility to sort using the city field by simply modifying the schema:

<field name="city_sort" type="lowercase" indexed="true" stored="false" />
...
<copyField source="city" dest="city_sort"/>

It turned out, that sorting using the city_sort field did not work as we expected. All because of the polish signs appearing in the city names. What should we do with it?

Requirements specification

Let’s check if the „city_sort” field sorting does really not working well in conjunction with the polish signs. When we enter the query:

q=*:*&fl=city&sort=city_sort+asc

we have the result:

<result name="response" numFound="6" start="0">
   <doc>
      <str name="city">Białystok</str>
   </doc>
   <doc>
      <str name="city">Koszalin</str>
   </doc>
   <doc>
      <str name="city">Szczecin</str>
   </doc>
   <doc>
      <str name="city">Warszawa</str>
   </doc>
   <doc>
      <str name="city">Świdnik</str>
   </doc>
   <doc>
      <str name="city">Łowicz</str>
   </doc>
</result>

That’s really not what we expect. We would like to have:

<result name="response" numFound="6" start="0">
   <doc>
      <str name="city">Białystok</str>
   </doc>
   <doc>
      <str name="city">Koszalin</str>
   </doc>
   <doc>
      <str name="city">Łowicz</str>
   </doc>
   <doc>
      <str name="city">Szczecin</str>
   </doc>
   <doc>
      <str name="city">Świdnik</str>
   </doc>
   <doc>
      <str name="city">Warszawa</str>
   </doc>
</result>

To make the sorting functionality work well, we will use the „solr.CollationKeyFilter” filter.

solr.CollationKeyFilter

The filter called solr.CollationKeyFilter is used at index time, indexing special “sort keys” into the sort field. It allows us to choose the collator related to wanted country and language. We can also choose the strength of the collation which determines the minimum level of difference considered significant during comparison. For example:

<filter class="solr.CollationKeyFilterFactory" language="es" country=”ES” strength="primary" />

The given example shows us the configuration of the solr.CollationKeyFilterFactory, where we want to handle the spanish language with the primary strength.

Schema.xml changes

1. New field types definitions:
<fieldType name="polishLowercase" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.TrimFilterFactory" />
    <filter class="solr.CollationKeyFilterFactory"  language="pl" country=”PL” strength="primary" />
  </analyzer>
</fieldType>

As we may notice, it’s the definition of the currently existing „lowercase” type, where we added the solr.CollationKeyFilter, handling the polish language. The type will be used for the fields, where the data contains polish signs.

2. New „city_sort” field definition:
  • let’s change the type for the „city_sort” field to „polishLowercase”:
    <field name="city_sort" type="polishLowercase" indexed="true" stored="false" />

Functional tests

Before we check if the given field type change is just what we need, we must remember that the solr.CollationKeyFilter is used at index time, so we need to re-index all of the data.

Now let’s check our test query result:

q=*:*&fl=city&sort=city_sort+asc

It appears that the result is correct:

<result name="response" numFound="6" start="0">
   <doc>
      <str name="city">Białystok</str>
   </doc>
   <doc>
      <str name="city">Koszalin</str>
   </doc>
   <doc>
      <str name="city">Łowicz</str>
   </doc>
   <doc>
      <str name="city">Szczecin</str>
   </doc>
   <doc>
      <str name="city">Świdnik</str>
   </doc>
   <doc>
      <str name="city">Warszawa</str>
   </doc>
</result>

The end

Yet another reported problem has been solved successfully. We have improved the quality of the sorting mechanism, where we must handle the polish signs, by adding the solr.CollationKeyFilter which entirely fulfilled our needs. Now we can only wait for another notifications and improvements :)

Published at DZone with permission of {{ articles[0].authors[0].realName }}, DZone MVB. (source)

Opinions expressed by DZone contributors are their own.

{{ tag }}, {{tag}},

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}
{{ parent.authors[0].realName || parent.author}}

{{ parent.authors[0].tagline || parent.tagline }}

{{ parent.views }} ViewsClicks
Tweet

{{parent.nComments}}