DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Developing Your Own Solr Filter - Part 2

Rafał Kuć user avatar by
Rafał Kuć
·
Feb. 06, 13 · Interview
Like (0)
Save
Tweet
Share
3.89K Views

Join the DZone community and get the full member experience.

Join For Free


in the previous entry “ developing your own solr filter ” we’ve shown how to implement a simple filter and how to use it in apache solr. recently, one of our readers asked if we can extend the topic and show how to write more than a single token into the token stream. we decided to go for it and extend the previous blog entry about filter implementation.

assumptions

let’s assume that we want not only to return the reversed word, but also to keep the original one. so if we would pass the fine word to the analysis we would like to have fine and eni freturned. in order to achieve that we will modify the filter created in the previous entry.

we will omit all the details of the configuration and installation, so if you would like to read about them please refer to the previous post.

solr version

we would also like to note that in this post we’ve decided to go for the newest available solr version, so we’ve used the 4.1 . in order to have the following filter working on 3.6 just used the filter factory shown in the previous entry.

filter factory

the only difference when it comes to the filter factory is the class we are extending. because we are using solr 4.1, we extend the tokenfilterfactory from the org.apache.lucene.analysis.util package:

public class reversefilterfactory extends tokenfilterfactory {
 @override
 public tokenstream create(tokenstream ts) {
  return new reversefilter(ts);
 }
}

filter

the filter was modified a bit more and looks like this:

public final class reversefilter extends tokenfilter {
 private chartermattribute chartermattr;
 private positionincrementattribute posincattr;
 private queue<char[]> terms;

 protected reversefilter(tokenstream ts) {
  super(ts);
  this.chartermattr = addattribute(chartermattribute.class);
  this.posincattr = addattribute(positionincrementattribute.class);
  this.terms = new linkedlist<char[]>();
 }

 @override
 public boolean incrementtoken() throws ioexception {
  if (!terms.isempty()) {
   char[] buffer = terms.poll();
   chartermattr.setempty();
   chartermattr.copybuffer(buffer, 0, buffer.length);
   posincattr.setpositionincrement(1);
   return true;
  }

  if (!input.incrementtoken()) {
   return false;
  } else {
   // we reverse the token
   int length = chartermattr.length();
   char[] buffer = chartermattr.buffer();
   char[] newbuffer = new char[length];
   for (int i = 0; i < length; i++) {
    newbuffer[i] = buffer[length - 1 - i];
   }
   // we place the new buffer in the terms list
   terms.add(newbuffer);
   // we return true and leave the original token unchanged
   return true;
  }
 }
}

implementation description

let’s talk about the differences between the above filter and the version shown in the previous blog post:

  • line 4 – a list, that will be used to hold term buffer that needs to be written to token stream.
  • line 9 – we add the attribute that is responsible for setting the token position in the token stream.
  • line 10 – initialization of the list we’ve defined in line 4.
  • line 15 – 21 – condition that checks if we have tokens in the lists that we need to process. if there are such tokens we take the first token from the list (and we remove it), we set the term buffer, we set its position in the token stream abd we return true in order to inform that processing should be continued. it is worth noting that we didn’t fetch a new token from the token stream, because we didn’t call the input.incrementtoken() method.
  • lines 23 – 25 – we check if there are tokens left for processing. if there are not we just return false and we end processing.
  • lines 27 – 36 – we reverse the token and we add it to the list declared on line 4. we didn’t modify the token that is actually present in the token and we return true . by doing this we inform solr that we want to continue processing and solr should call the incrementtoken() method of our filter. after the next call we will end up executing the code from lines 15 – 21 because our list will contain a new token.

there is one more thing that is worth noting in our opinion – the position increment attribute. in the above implementation, each reversed token will be written in the next position in the token stream comparing to the original token. to put it simple – those tokens will be treated as single words. in a few we will check what will happen when the reversed tokens will be put on the same positions as the original ones.

does it work ?

the work of the above filter can be illustrated by using solr administration panel:

wlasny_filtr_with_increment

as you can see, both original values: ala , ma and kota and the reversed ones ala , am and atok were put on in a new positions (the position attribute). so it works as intended

let’s change the position increment

so let’s check what will happen when we change the following line:

1 posincattr.setpositionincrement( 1 );

to the following one:

1 posincattr.setpositionincrement( 0 );

how it works

again, let’s illustrate how it works by looking at solr administration panel:

wlasny_filtr_without_increment as you can see, after the change, both original token as well as its reversed version were put on the same position, which is what we wanted to achieve. because of that we can now run queries like q=”ala am kota” (look at the am word). what we gain is the ability to use original or reversed tokens in the phrase queries.

to sum up

as you can see creating your own filters is not a rocket science, at least when it comes to lucene and solr part. what we get from lucene and solr is a nice set of features which we can use to control what will be finally put in the token stream, for example thanks to token stream attributes. of course the complexity of the code will depend on you business logic, but this is far beyond the scope of this post :)

Filter (software)

Published at DZone with permission of Rafał Kuć, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Type Variance in Java and Kotlin
  • Unlocking the Power of Polymorphism in JavaScript: A Deep Dive
  • How To Generate Code Coverage Report Using JaCoCo-Maven Plugin
  • Public Cloud-to-Cloud Repatriation Trend

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: