Solr vs. Elasticsearch
Solr vs. Elasticsearch
This list will give you an idea of what to expect from Solr, what to expect from Elasticsearch, and how these expectations differ.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Solr vs. Elasticsearch. Elasticsearch vs. Solr. Which one is better? How are they different? Which one should you use?
Before we start, check out two useful Cheat Sheets to guide you through both Solr and Elasticsearch and help boost your productivity and save time when you’re working with any of these two open-source search engines.
These two are the leading, competing open-source search engines known to anyone who has ever looked into (open-source) search. They are both built around the core underlying search library — Lucene — but they are different. Like everything, each of them has its set of strengths and weaknesses and each may be a better or worse fit depending on your needs and expectations. In the past, we’ve covered Solr and Elasticsearch differences in Solr Elasticsearch Comparison and in various conference talks, such as Side by Side with Elasticsearch and Solr: Performance and Scalability given at Berlin Buzzwords. Both Solr and Elasticsearch are evolving rapidly so, without further ado, here is up to date information about their top differences.
|Community and Developers||Apache Software Foundation and community support||Single commercial entity and its employees|
|Node Discovery||Apache Zookeeper, mature and battle tested in a large number of projects||Zen, built into Elasticsearch itself, requires dedicated master nodes to be split brain proof|
|Shard Placement||Static in nature, requires manual work to migrate shards||Dynamic, shards can be moved on demand depending on the cluster state|
|Caches||Global, invalidated with each segment change||Per segment, better for dynamically changing data|
|Analytics Engine||Facets and powerful streaming aggregations||Sophisticated and highly flexible aggregations|
|Optimized Query Execution||Currently none||Faster range queries depending on the context|
|Search Speed||Best for static data, because of caches and uninverted reader||Very good for rapidly changing data, because of per segment caches|
|Analysis Engine Performance||Great for static data with exact calculations||Exactness of the results depends on data placement|
|Full-Text Search Features||Language analysis based on Lucene, multiple suggesters, spell checkers, rich highlighting support||Language analysis based on Lucene, single suggest API implementation, highlighting rescoring|
|DevOps Friendliness||Not fully there yet, but coming||Very good APIs|
|Non-Flat Data Handling||Nested documents and parent child support||Natural support with nested and object types allowing for virtually endless nesting and parent-child support|
|Query DSL||JSON (limited), XML (limited) or URL parameters||JSON|
|Index/Collection Leader Control||Leader placement control and leader rebalancing possibility to even the load on the nodes||Not possible|
|Machine Learning||Built-in; on top of streaming aggregations focused on logistic regression and learning to rank contrib module||Commercial feature, focused on anomalies and outliers and time-series data|
|Ecosystem||Modest; Banana, Zeppelin with community support||Rich; Kibana, Grafana, with large entities support and big user base|
Now that we know what the top 15 differences are, let’s discuss each of the mentioned differences in greater detail.
Community and Developers
The first major difference between Solr and Elasticsearch is how they are developed, maintained and supported. Solr, being a project of Apache Software Foundation is developed with ASF philosophy in mind: Community over code. Solr code is not always beautiful, but once the feature is there it usually stays there and is not removed from the code base. Also, the committers come from different companies and there is no single company controlling the code base. You can become a committer if you show you interest and continued support for the project. On the other hand, we have Elasticsearch backed by a single entity: the Elastic company. The code is available under the Apache 2.0 software license and the code is open and available on Github, so you can take part in the development by submitting pull requests, but the community is not the one that decides what will get into the code base and what will not. Also, to become a committer, you will have to be a part of the Elastic company itself.
Another major difference between those two great products is the node discovery. When the cluster is initially formed, when a new node joins or when something bad happens to a node something in the cluster, based on the given criteria, has to decide what should be done. This is one of the responsibilities of so-called node discovery. Elasticsearch uses its own discovery implementation called Zen that, for full fault tolerance (i.e. not being affected by network splits), requires three dedicated master nodes. Solr uses Apache Zookeeper for discovery and leader election. This requires an external Zookeeper ensemble, which for fault-tolerant and fully available SolrCloud cluster requires at least three Zookeeper instances.
Generally speaking, Elasticsearch is very dynamic as far as placement of indices and shards they are built of is concerned. It can move shards around the cluster when a certain action happens, for example, when a new node joins or a node is removed from the cluster. We can control where the shard should and shouldn’t be placed by using awareness tags and we can tell Elasticsearch to move shards around on demand using an API call. Solr, on the other hand, is a bit more static. When a Solr node joins or leaves the cluster Solr doesn’t do anything on its own, it is up to us to rebalance the data. Of course, we can move shards, but it involves several steps – we need to create a replica, wait for it to synchronize the data and then remove the one that we no longer need. There is one thing that allows us to automate things a bit – removing or replacing a node in SolrCloud using Collection API, which is a quick way of removing all shards or quickly replicate them to another node. Though this still requires manual API call, not something that is done automatically.
Yet another big difference is the architecture of the two discussed search engines. Not getting deep into how the caches work in both products we will point out just the major difference between them. Let’s start with what a segment is. A segment is a piece of Lucene index that is built of various files, is mostly immutable, and contains data. When you index data Lucene produces segments and can also merge multiple smaller, already existing ones into larger ones during a process called segment merging. The caches in Solr are global, a single cache instance of a given type for a shard, for all its segments. When a single segment changes the whole cache needs to be invalidated and refreshed. That takes time and consumes hardware resources. In Elasticsearch caches are per segment, which means that if only a single segment changed then only a small portion of the cached data needs to be invalidated and refreshed. We will get to the pros and cons of such approach soon.
Solr is large and has a lot of data analysis capabilities. We can start with good, old facets – the first implementation that allowed to slice and dice through the data to understand it and get to know it. Then came the JSON facets with similar features, but faster and less memory demanding, and finally the stream based expressions called streaming expressions which can combine data from multiple sources (like SQL, Solr, facets) and decorate them using various expressions (sort, extract, count significant terms, etc). Elasticsearch provides a powerful aggregations engine that not only can do one level data analysis like most of the Solr legacy facets, but can also nest data analysis (e.g., calculate average price for each product category in each shop division), but supports for analysis on top of aggregation results, which leads to functionality like moving averages calculation. Finally, though marked as experimental, Elasticsearch provides support for matrix aggregation, which can compute statistics over a set of fields.
Optimized Query Execution
When dealing with time-based data, range queries are very common and can become a bottleneck, because of the amount of data they need to process to match the given search results. With the recent releases of Elasticsearch, for the fields that have doc values enabled (like numeric fields), Elasticsearch is able to choose whether to iterate over all the documents or only match a particular set of documents. With the logic inside the search engine, Elasticsearch can provide a very efficient range queries without any data modifications. Hopefully, we will see a similar functionality in Solr as well.
Some time ago we did a few comparisons of Solr and Elasticsearch and the results were pretty clear. Solr is awesome when it comes to the static data, because of its caches and the ability to use uninverted readers for faceting and sorting; for example, e-commerce. Elasticsearch is great in rapidly changing environments, like log analysis use cases. If you want to learn more, check out the video from two of our engineers — Radu and Rafał giving Side-by-side with Elasticsearch and Solr Part 2: Performance and scalability talk at Berlin Buzzwords 2015.
Analysis Engine Performance and Precision
When having a mostly static data and needing full precision for data analysis and blazingly fast performance you should look at Solr. With the tests we did for some conference talks (like the mentioned Side-by-side with Elasticsearch and Solr Part 2: Performance and scalability talk at Berlin Buzzwords 2015) we saw that on static data Solr was awesome. What’s more, compared to Elasticsearch facets in Solr are precise and do not lose precision, which is not always true with Elasticsearch. In certain edge cases, you may find results in Elasticsearch aggregations not to be precise, because of how data in the shards is placed.
Full-Text Search Features
The richness of full-text search related features and the ones that are close to full-text searching is enormous when looking into Solr code base. Our Solr training classes are chalk-full of this stuff! Starting from a wide selection of request parsers, through various suggester implementations, to ability to correct user spelling mistakes using spell checkers and extensive highlighting support which is highly configurable. In Elasticsearch we have a dedicated suggesters API which hides the implementation details from the user giving us an easier way of implementing suggestions at the cost of reduced flexibility and of course highlighting which is less configurable than highlighting in Solr (though both are based on Lucene highlighting functionality).
If you were to ask a DevOps person what (s)he loves about Elasticsearch the answer would be the API, manageability, and ease of installation. When it comes to troubleshooting Elasticsearch is just easy to get information about its state — from disk usage information, through memory and garbage collection work statistics to the internal of Elasticsearch like caches, buffers and thread pools utilization. Solr is not there yet — you can get some amount of information from it via JMX MBean and from the new Solr Metrics API, but this means there are a few places one must look a and not everything is in there, though it’s getting there
Non-Flat Data Handling
You have non-flat data, with lots of nested objects inside nested object and inside another nested object and you don’t want to flatten down the data, but just index your beautiful MongoDB JSON objects and have it ready for full-text searching? Elasticsearch will be a perfect tool for that with its support for objects, nested documents, and parent-child relationships. Solr may not be the best fit here, but remember that it also supports parent – child and nested documents both when indexing XML documents as well as JSON. Also, there is one more very important thing: Solr supports query time joins inside and across different collections, so you are not limited to index time parent-child handling.
Let’s say it out loud: The query language of Elasticsearch is really great...if you love JSON. It lets you structure the query using JSON, so it will be well structured and give you the control over the whole logic. You can mix different kinds of queries to write very sophisticated matching logic. Of course, full-text search is not everything and you can include aggregations, results collapsing, and so on — basically, everything that you need from your data can be expressed in the query language. Solr, on the other hand, is still using the URI search, at least in its most widely used API (there is also limited JSON API and XML query parser available). All the parameters go into the URI, which can lead to long and complicated queries. Both approaches have their pros and cons and novice users tend to need help with queries using both search engines.
Index/Collection Leader Control
While Elasticsearch is dynamic in its nature when it comes to shard placement around the cluster it doesn’t give us much control over which shards will take the role of the primaries and which ones will be the replicas. It is beyond our control. In Solr, you have that control, which is a very good thing when you consider that during indexing the leaders are the ones that do more work, because of forwarding the data to all their replicas. With the ability to rebalance the leaders or explicitly say where they should be put we have the perfect ability to balance the load across the cluster, by providing exact information about where the leader shards should be.
Trending topic about which you will hear even more in the coming months and years: Machine :earning. In Solr it comes for free in a form of a contrib module and on top of streaming aggregations framework. With the use of the additional libraries in the contrib module, you can use the machine learned ranking models and feature extraction on top of Solr, while the streaming aggregations based machine learning is focused on text classification using logistic regression. On the other hand we have Elasticsearch and its X-Pack commercial plugin which comes with a plugin for Kibana that supports machine learning algorithms focused on anomaly detection and outlier detection in time series data.
When it comes to the Ecosystem, the tools that come with Solr are nice, but they feel modest. We have Kibana port called Banana which went its own way and tools like Apache Zeppelin integration that allows running SQL on top of Apache Solr. Of course, there are other tools, which can either read data from Solr, send data to Solr or use Solr as the data source — like Flume for example. Most of the tools are developed and supported by a wide variety of enthusiasts. If you look at the ecosystem around Elasticsearch it is very modern and sorted. You have a new version of Kibana with new features popping up every month. If you don’t like Kibana, you have Grafana which is now a product on its own providing a wide variety of features, you have a long list of data shippers and tools that can use Elasticsearch as a data source. Finally, those products are not only backed up by enthusiasts but also by large, commercial entities. This is obviously not an exhaustive list of Solr and Elasticsearch differences. We could go on for several blog posts and make a book out of it, but hopefully, the above list gave you an idea on what to expect from one and the other.
Published at DZone with permission of Rafal Kuc . See the original article here.
Opinions expressed by DZone contributors are their own.