“The reports of my death are greatly exaggerated.”
— Mark Twain (albeit (mis)quoted)
Hey, clickbait title aside, I get it, Elasticsearch has been growing. Kudos to the project for tapping into a new set of search users and use cases like logging, where they are making inroads against the likes of Splunk in the IT logging market. However, there is another open source, Lucene-based search engine out there that is quite mature, more widely deployed and still growing, granted without a huge marketing budget behind it: Apache Solr. Despite what others would have you believe, Solr is quite alive and well, thank you very much. And I’m not just saying that because I make a living off of Solr (which I’m happy to declare up front), but because the facts support it.
For instance, in the Google Trends arena (see below or try the query yourself), Solr continues to hold a steady recurring level of interest even while Elasticsearch has grown. Dissection of these trends (which are admittedly easy to game, so I’ve tried to keep them simple), show Elasticsearch is strongest in Europe and Russia while Solr is strongest in the US, China, India, Brazil and Australia. On the DB-Engines ranking site, which factors in Google trends and other job/social metrics, you’ll see both Elasticsearch and Solr are top 15 projects, beating out a number of other databases like HBase and Hive. Solr’s mailing list is quite active (~280 msgs per week compared to ~170 per week for Elasticsearch) and it continues to show strong download numbers via Maven repository statistics. Solr as a codebase continues to innovate (which I’ll cover below) as well as provide regular, stable releases. Finally, Lucene/Solr Revolution, the conference my company puts on every year, continues to set record attendance numbers.
Rather than getting into an “us vs. them” argument, however, I am going to focus the rest of this article on why you should consider Solr, especially if you haven’t looked at Solr 6.
1. Real Time, Massive Read, and Write Scalability
If you haven’t looked at Solr since it’s early days (pre-Solr 4), you might be surprised to see that Solr supports large-scale, distributed indexing, search, and aggregation/statistics operations, enabling it to handle applications large and small.
Solr also supports real-time updates and can handle millions of writes per second. For instance, at Lucene/Solr Revolution, Salesforce shared that they have over 500 billion complex — not just logs — documents in Solr and are doing 7 billion updates per day with a sub 100-millisecond query latency. Based on a number of other talks at Revolution (Bloomberg, Microsoft, Wal-Mart, et. al.) as well as knowledge of my company’s clientele, Salesforce isn’t alone in those numbers.
2. SQL and Streaming Expressions/Aggregations
Streaming expressions and aggregations provide the basis for running traditional data warehouse workloads on a search engine with the added enhancement of basing those workloads on much more complex matching and ranking criteria than most data warehouses are capable of handling For instance, you might define a query that does a complex query across text, spatial, and numeric data and then rolls up on location, buckets into ranges by numeric data and then streams out the results to Tableau. Even better, you may simply hook up your favorite JDBC/ODBC client directly to Solr and write good old fashioned SQL queries to run against Solr. That functionality is still early in its support of the SQL specification, but it promises to bring the power of search-driven analytics to more people instead of trying to retrain them on new concepts.
3. Security Out of the Box
For many open source based product companies, security is something you pay extra for. With Solr, security is built in, integrating with systems like kerberos, SSL, and LDAP to not only secure the system, but also the content inside of it. Add in support for graph queries (more on graph later) and you can represent very complex ACL-based (Access Control List) queries in highly dynamic environments.
4. Losing Data is for Losers
In no small part due to its longstanding role in mission critical applications (think eCommerce on Black Friday), when Solr moved from a master-replica model to a fully distributed sharding model in Solr 4, the Solr community made a conscious decision to focus on consistency and accuracy of results over other distributed approaches. In doing so, the community chose Apache Zookeeper (or as I like to call it “Apache Give Me My Nights and Weekends Back”) to help with things like distributed coordination of resources and fault tolerance.
While this added some overhead to Solr in getting started compared to other multicast/”zero conf” solutions, it more than makes up for it on the backend in terms of making sure that data loss is minimized and that the cluster can tell when it is happy or not, all things that keep many DevOps and Engineers up at night when supporting other distributed systems. This is, of course, not to say that Solr is perfect here, but it has proven to stand up well compared to others in this arena. If you really like to geek out on distributed system testing, be sure to check out Solr’s fork of Kyle Kingsbury’s (aka @aphyr, the destroyer of distributed systems) work on Jepsen.
5. Cross Data Center Replication Support
Continuing on the theme of “getting your nights and weekends back”, Solr has recently added support for active-passive Cross Data Center Replication (CDCR), enabling real-world applications to synchronize indexing operations across data centers located in different parts of the world without the need for third-party systems like Apache Kafka. In Solr’s case, “passive” is often milliseconds or seconds (as opposed to minutes and hours) for low-to-medium ingest rates (which is commonly the case for content like eCommerce product catalogs but not the case for logging) and the community is working on a number of improvements to performance. Physics plays a role as well, so if CDCR is what you are after, be sure to invest in the network, too.
6. Graph Support
As I mentioned earlier and in an effort to be fully buzzword compliant, Solr has added support for graph operations over the data that allow you to query and work with complex relationships between data and then output to formats like GraphML. Graph operations can be used to model recommendation systems (“people who bought this, bought that”), access controls in securing content and other complex relationships where traditional search systems typically fail (“find me all customers whose supplier is AcmeCo and their contract language contains version 1 of our indemnification verbiage”.) And since Solr’s graph capabilities are built on the streaming expressions capability discussed earlier, you can combine all of these features together to build very complex solutions to data analysis problems. Keep in mind, Solr’s support for graph is still up and coming compared to long-lived graph libraries like Neo4J, but the fact that it exists at all may mean you can simplify your application by removing yet another moving part.
7. Solr Plays Nice With Big Data Toolchains
Many modern real world applications consist of two parts: a query engine to serve up the best known state of the data of interest and an offline engine designed to update and enhance that best known state. For instance, when you query Google, you get results based on the last time Google crawled/accessed the data in question. It could be two minutes ago or two weeks ago. For us mere mortals, this typically means we use tools like Apache Hadoop, Apache Spark and various other “Hadoop Stack” capabilities to handle the offline crunching of data. Solr plays nicely in this stack in a number of ways: 1) Users can store Solr’s data in HDFS, 2) Solr integrates nicely with Hadoop’s authentication approaches (typically Kerberos), 3) Solr leverages Zookeeper much like Apache HBase to simplify fault tolerance infrastructure, 4) the Spark-Solr project works hard to make sure Solr does what it is good at and Spark does what it is good at (this combo forms the basis of my company’s product, Lucidworks Fusion), 5) Hive, Pig, Storm, and Map-Reduce are also all represented with open source repositories.
8. First Class Documentation and Support
Solr has long maintained an extensive reference guide that covers functional and operational aspects of Solr for every version. Whether you are just getting started or looking for which knobs to tune for performance, the reference guide likely covers it. If it doesn’t, there are a large number of books written on Solr as well as a healthy mailing list and Stack Overflow Q&A community. Finally, while I am biased to one in particular, there are a number of consultants and vendors who can provide professional support.
9. Kibana and Logstash Both Work with Solr
Open Source FTW! Originally written to work with Elasticsearch, Logstash and Kibana, the popular tools for indexing and visualizing log and event data, both work with Solr. Logstash support is provided via a Solr plugin and Kibana has been forked via the Banana project.
10. Solr and Machine Learning
For consumer and other high user volume sites, the core ranking function provided by Lucene is often not enough to achieve best in class relevance results. Many best in class search systems these days also incorporate machine learning algorithms that examine how users interact with content in search and then re-rank future searches to account for what users are most likely to care about. In search speak, this is called Learning to Rank (LTR) and Solr is actively adding capabilities to make LTR an out of the box functionality. Combining this with the Spark integration mentioned earlier and you can have a smart, fast search system using best in class techniques for ranking.
Where to Next?
All of the above are technical reasons you should consider Solr, but there are also one really important intangible benefit of Solr: it is an Apache Software Foundation (ASF) project, which means it’s not just open source, but open development. Solr is backed by a committee of maintainers who work for a variety of different companies, big and small. This benefits the code as well as the development process and ensures the project is viable long past the contribution of any one person or company. To give Solr a try, check out the quickstart. Happy searching!