Over a million developers have joined DZone.

Solr vs. Elasticsearch for Controlling Matching

Solr vs. Elasticsearch! There’s been a TON of posts. But has anyone ever compared them when it comes to traditional — you know — controlling regular 'ol search results!?

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

This article summarizes parts of Relevant Search. Use discount code turnbullmu to get 38% off!

Solr vs. Elasticsearch! There’s been a TON of posts. But has anyone ever compared them when it comes to traditional — you know — controlling regular 'ol search results!?

Image title

In my work, I spend a lot of time helping people improve the relevance of their Solr/Elasticsearch search results (I’m writing a book on the subject!). In this first in a series of articles, I want to help you see the forest for the trees when it comes to deciding between these two technologies for traditional search problems. Problems where you're likely to need to tune the relevance of the results for your needs. Frankly, you probably need to do this, even if you don't think you do. Everyone's headed for that search bar, is your search going to "get" them or just return a random set of 10 results?

Anyway, evaluating Solr vs Elasticsearch with regards to relevance comes down to comparing them across three criteria:

  1. Ability to control matching — what results can be said to match/not match the search?
  2. Ability to control ranking — within the set of matched results, what results are most relevant?
  3. Ability to create plugins — how deeply can you manipulate the matching/ranking of results beyond APIs?

In this blog article, I'm going to begin with item 1: comparing how these search engines let you manipulate matching.

More Alike Than Different

Before we begin to dive in, it's important to note that Solr and Elasticsearch are in many ways more alike than different. Both search engines give you a tremendous amount of ability to manipulate search relevance at a very fundamental level. Being open source search engines based on Lucene, they sit apart from black-box commercial offerings that limit how much relevance tuning you can do. They both support an extremely wide array of use cases from traditional document-based search, to geo-aware, graph search, to well you name it. Both provide rich query DSLs and text analysis DSLs that let you tightly control the matching and ranking process. This is why you can almost entirely apply our book Relevant Search to either search engine just fine!

That being said, while the core features are the same, the ergonomics and the APIs can make your search solution easy or hard! Let's dive into matching, to see how one search engine makes controlling matching rather straight forward, while another one requires deeper thinking and engineering work.

Solr vs. Elasticsearch Controlling Matching

Controlling matching comes down to fine tuning how text transforms into individual tokens. The search engine treats tokens as the fundamental, atomic entity that must compare exactly for a "match" to be declared. You need a normalization process to get tokens from a search to match tokens from a document. For example, a search for all uppercase "CAT" will not match an all lowercase "cat" without the proper text manipulation steps to lowercase both sides. A less silly example is deciding whether you should treat "kitty" as a synonym for "cat." Should a "kitty" search term be normalized to "cat"? When you think of all the many forms of English words (run, running, ran?) you can see how important this normalization process is for deciding exactly when to match and when not to match.

This normalization task is controlled by you via a feature known as analysis (as we've discussed). Analysis controls the process for manipulating text from documents/searches are transformed into tokens. The art of search relevance often boils down to controlling this process to carefully discriminate between matches/nonmatches.

Both Solr and Elasticsearch support the underlying library of Lucene analyzers. The ergonomics of creating an analyzer differ only superficially. Solr (until recently) encourages you to work in an XML configuration file (schema.xml) to control this process. Elasticsearch, on the other hand, give you a RESTful JSON API for doing the same work.

Both search engines give you the same ingredients to work with to build analyzers. You manipulate the original character stream through several character filters. You break up the character stream into tokens using a tokenizer. Finally you apply a configurable sequence of token filters to trim, splice, delete, generate, and otherwise manipulate the tokens.

To tightly control this process, in addition to analyzing the content placed in the search engine, both Solr and Elasticsearch let you specify a separate query analyzer. Recall the query analyzers transforms the search string to tokens to control how they'll match against tokens generated from text placed in the index. This is great, for example, if you wanted to expand your search out to include a list of extra synonyms to search for in addition to the original search query.

Solr, however, comes with two serious deficiencies with query analysis. The biggest source of heartache is the so-called sea biscuit problem. To make a long story short, Solr's default query parser breaks up text by whitespace before passing it to a query time analyzer. So if you have a synonym rule that maps the phrase "sea biscuit" to a single word seabiscuit, it doesn't work!

The reason is that each whitespace delimited query term is analyzed entirely separately, oblivious of the remaining text in the query string. The query-time analyzer takes as input the single word [sea]. The synonym filter does not see the token [biscuit] following. The problem isn't just limited to synonyms, it impacts any process that needs to manipulate more than one token at a time. This can be rather limiting and surprising when trying to control the matching process. Queue sad trombone.

This problem is constantly the source of surprise, relevance bugs, and user group discussion. Even moderately skilled Solr users are surprised by this behavior. The situation is not entirely hopeless. There exist numerousSolrplugins for this issue. Solr also gives you a lot of power to write just about any plugin to deeply control this yourself. Not everyone though wants to figure out someone else's plugin or deal with writing Java.

Solr's other problem is that it only lets you control the query side analyzer in a field's configuration. Elasticsearch lets you control the analyzer to use at just any level, including passing the analyzer to use when running a query itself! Elasticsearch's options are quite broad.

At search time, the sequence is slightly different:
* The analyzer defined in the query itself, else
* The analyzer defined in the field mapping, else
* The default analyzer for the type, which defaults to
  * The analyzer named default in the index settings, which defaults to
  * The analyzer named default at node level, which  defaults to
  * The standard analyzer

This comes in very handy. Say for example, you'd like to query one field sometimes by using synonyms and another way without synonyms. Elasticsearch lets you query the same field different ways simply by passing an analyzer argument with the query. Sometimes you'd pass in the synonym analyzer, other times you might simply use the default. Solr, on the other hand, would have you duplicate content to another field to pick up a different query analyzer.

In conclusion, Elasticsearch is the clear winner when it comes to matching. Elasticsearch gives you the least amount of surprise and most depth of configurability when it comes to manipulating matching. You can do quite a lot without resorting to writing plugins.


alt text

Next Time! Ranking!

Next time, we'll discuss how Solr and Elasticsearch's query APIs match up. Will Elasticsearch be dominant here as well? To complete the trilogy we'll also compare how pluggable each search engine is. Which search engine will win? :)

And of course if you need help with a tough search problem, don't hesitate to contact us!

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

solr ,elasticsearch ,machine learning

Published at DZone with permission of Doug Turnbull, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}