Platinum Partner
java,solr,apache

Solr filters: KeepWordFilter

This time I decided to look at one of the unusual filters available in the standard distribution of Solr. The first one in my hands is a filter called KeepWordFilter.

Let’s start

First, a few words about what this filter does. As the name might indicate the main purpose of this filter is to “stop” words. More specifically, the filter does the opposite of filter called StopFilter. So how does this filter work ? I’ll talk about this in a moment – let’s start with the definition of the type and fields in the schema.xml file:

<fieldtype name="keepwords" class="solr.TextField">
   <analyzer>
      <code><</code><code>tokenizer</code> <code>class</code><code>=</code><code>"solr.WhitespaceTokenizerFactory"</code><code>/></code>
      <filter class="solr.KeepWordFilterFactory" words="words.txt" ignoreCase="true"/>
   </analyzer>
</fieldtype>

As shown in the above definition in addition to the standard class and name attributes the filter has two additional attributes::

  • words – the list of words to keep
  • ignoreCasetrue | false value indicating case ignore functionality.

File contents

Let’s assume that the words.txt file contain the following words:

ala
ma
kota

If you would like to index the phrase “Ala ma kota, a kot ma Alę” the following tokens will be written into the index: “ala”, “ma”, “kota”, “ma” because only those terms are defined in the words.txt file. This is clearly visible evident in the Solr administration panel:

A few words at the end

Although I never used the filter it seems to me that this is a good filter to use when you need to store the values of  enumerated types, or in situations where we are interested in finite, or even better – a small and known in advance list of values, such as the categories where we can not filter information at the application level, or when it is very difficult.



Published at DZone with permission of {{ articles[0].authors[0].realName }}, DZone MVB. (source)

Opinions expressed by DZone contributors are their own.

{{ tag }}, {{tag}},

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}
{{ parent.authors[0].realName || parent.author}}

{{ parent.authors[0].tagline || parent.tagline }}

{{ parent.views }} ViewsClicks
Tweet

{{parent.nComments}}