By Trey Grainger and Timothy Potter, authors of Solr in Action
Apache Solr, which, like its non-relational brethren, is optimized for a unique class of problems. Specifically, Solr is a scalable, ready-to-deploy enterprise search engine that's optimized to search large volumes of text-centric data and return results sorted by relevance. In this article, based on chapter 1 of Solr in Action, the authors explain Solr features.
Solr has a modern, well-designed architecture that's scalable and fault tolerant. Although these are important aspects to consider if you've already decided to use Solr, you still might not be convinced that Solr is the right choice for your needs. In the next section, we describe the benefits of Solr from the perspective of different stakeholders, such as the software architect, system administrator, and CEO.
In this section, we hope to provide you with some key information to help you decide if Solr is the right technology for your organization. Let's begin by addressing why Solr is attractive to software architects.
Solr for the software architect
When evaluating new technology, software architects must consider a number of factors, but chief among those are stability, scalability, and fault-tolerance. Solr scores high marks in all three categories.
In terms of stability, Solr is a mature technology supported by a vibrant community and seasoned committers. One thing that shocks new users to Solr and Lucene is that it isn't unheard of to deploy from source code pulled directly from the trunk rather than waiting for an official release. We won't advise you either way on whether this is acceptable for your organization. We only point this out because it's a testament to the depth and breadth of automated testing in Lucene and Solr. Put simply, if you have a nightly build off trunk where all the automated tests pass, then you can be sure the core functionality is solid.
As an architect, you're probably most curious about the limitations of Solr's approach to scalability and fault tolerance. First, you should realize that the sharding and replication features in Solr have been rewritten in Solr 4 to be robust and easier to manage. The new approach to scaling is called SolrCloud. Under the covers, SolrCloud uses Apache Zookeeper to distribute configuration across a cluster of Solr nodes and to keep track of cluster state. Here are some highlights of the new SolrCloud features in Solr:
- Centralized configuration
- Distributed indexing with no Single Point of Failure (SPoF)
- Automated fail-over to a new shard leader
- Queries can be sent to any node in a cluster to trigger a full distributed search across all shards with fail-over and load-balancing support built-in
But this isn't to say that Solr scaling doesn't have room for improvement. SolrCloud is lacking in two areas. First, not all features work in distributed mode, such as joins. Second, the number of shards for an index is a fixed value that can't be changed without reindexing all of the documents. Solr scaling has come a long way in the past few years.
Solr for the system administrator
As a system administrator, high among your priorities in adopting a new technology like Solr is whether it fits into your existing infrastructure. The easy answer is yes, it does. As Solr is Java based, it runs on any OS platform that has a J2SE 6.x/7.x JVM. Out of the box, Solr embeds Jetty, the open source Java servlet engine provided by Oracle. Otherwise, Solr is a standard Java web application that deploys easily to any Java web application server like JBoss and Oracle AS.
All access to Solr can be done via HTTP and Solr is designed to work with caching HTTP reverse proxies like Squid and Varnish. Solr also works with JMX so you can hook it up to your favorite monitoring application, such as Nagios.
Lastly, Solr provides a nice administration console for checking configuration settings, statistics, issuing test queries, and monitoring the health of SolrCloud. Figure 1 provides a screen shot of the Solr 4 administration console.
Figure 1 Screen shot of Solr 4 administration console where you can send test queries, ping the server, view configuration settings, and see how your shards and replicas are distributed in a cluster.
Solr for the CEO
Although it's unlikely that a CEO will be reading this book, here are some key talking points about Solr in case your CEO stops you in the hall. First, executive types like to know an investment in a technology today is going to payoff in the long term. With Solr, you can emphasize that many companies are still running on Solr 1.4, which was released in 2009, which means Solr has a successful track record and is constantly being improved.
Also, CEO's like technologies that are predictable. As you'll see in the next chapter, Solr "just works" and you can have it up and running in minutes. Another concern is what happens if the Solr guy walks out the door—will business come to a halt? It's true that Solr is complex technology but having a vibrant community behind it means you have help when you need it. And, you have access to the source code, which means if something is broken and you need a fix, you can do it yourself. Many commercial service providers also can help you plan, implement, and maintain your Solr installation; many of which offer training courses for Solr.
This may be one for the CFO, but Solr doesn't require much initial investment to get started. Without knowing the size and scale of your environment, we're confident in saying that you can start up a Solr server in a few minutes and be indexing documents quickly. A modest server running in the cloud can handle millions of documents and many queries with sub-second response times.
Finally, let's do a quick rundown of Solr's main features organized around the following categories:
- User experience
- Data modeling
- New features in Solr 4
Next, we'll talk about how Solr helps make your users happy.
User experience features
Solr provides a number of important features that help you deliver a search solution that's easy to use, intuitive, and powerful. But you should note that Solr only exposes a REST-like HTTP API and doesn't provide search-related UI components in any language or framework. You'll have to roll up your sleeves and develop your own search UI components that take advantage of some of the following user experience features:
- Pagination and sorting
- Spell checking
- Hit highlighting
- Geo-spatial search
Pagination and sorting
Rather than returning all matching documents, Solr is optimized to serve paginated requests where only the top N documents are returned on the first page. If users don't find what they're looking for on the first page, then you can request subsequent pages using simple API request parameters. Pagination helps with two key outcomes: results are returned more quickly because each request only returns a small subset of the entire search results and helps you track how many queries result in requests for more pages, which may be an indication of a problem in relevance scoring.
Faceting provides users with tools to refine their search criteria and discover more information by categorizing search results into sub-groups using facets. Search results from a basic keyword search may be organized into three facets: features, home style, and listing type. Solr faceting is one of the more popular and powerful features available in Solr.
Most users will expect your search application to "do the right thing" even if they provide incomplete information. Auto-suggest helps users by allowing them to see a list of suggested terms and phrases based on documents in your index. Solr's auto-suggest features allows a user to start typing a few characters and receive a list of suggested queries as they type. This reduces the number of incorrect queries, particularly because many users may be searching from a mobile device with small keyboards.
Auto-suggest gives users examples of terms and phrases available in the index. As a user types real "higâ€¦" in a real estate search, Solr's auto-suggestion feature can return suggestions like "highlands neighborhood" or "highlands ranch."
In the age of mobile devices and people on the go, spell-correction support is essential. Again, users expect to be able to type misspelled words into the search box and the search engine should handle it gracefully. Solr's spellchecker supports two basic modes:
- Auto-correct—Solr can make the spell correction automatically based on whether the misspelled term exists in the index.
- Did you mean—Solr can return a suggested query that might produce better results so that you can display a hint to your users, such as "Did you mean highlands?" if you user typed in "hilands."
Spell correction was revamped in Solr 4 to be easier to manage and maintain.
When searching documents that have a significant amount of text, you can display specific sections of each document using Solr's hit highlighting feature. Most useful for longer format documents, hit highlighting helps users find relevant documents by highlighting sections of search results that match the user's query. Sections are generated dynamically based on their similarity to the query.
Geographical location is a first-class concept in Solr 4 in that it has built-in support for indexing latitude and longitude values as well as sorting or ranking documents by geographical distance. Solr can find and sort documents by distance from a geo-location (latitude and longitude). In the real estate example, matching listings are displayed on an interactive map where users can zoom in/out and move the map center point to find near-by listings using geo-spatial search.
Another exciting addition to Solr 4 is that you can index geographical shapes such as polygons, which allows you to find documents that intersect geographical regions. This might be useful for finding home listings in specific neighborhoods using a precise geographical representation of a neighborhood.
Data modeling features
Solr is optimized to work with specific types of data. In this section, we provide an overview of key features that help you model data for search, including:
- Field collapsing/grouping
- Flexible query support
- Importing rich document formats like PDF and Word
- Importing data from relational databases
- Multilingual support
Although Solr requires a flat, denormalized document, Solr allows you to treat multiple documents as a group based on some common property shared by all documents in a group. Field grouping, also known as field collapsing, allows you to return unique groups instead of individual documents in the results.
The classic example of field collapsing is threaded email discussions where emails matching a specific query could be grouped under the original email message that started the conversation.
Powerful and flexible query support
Solr provides a number of powerful query features including:
- Conditional logic using and, or, not
- Wildcard matching
- Range queries for dates and numbers
- Phrase queries with slop to allow for some distance between terms
- Fuzzy string matching
- Regular expression matching
- Function queries
In SQL, you use a JOIN to create a relation by pulling data from two or more tables together using a common property such as a foreign key. But in Solr, joins are more like sub-queries in SQL in that you don't build documents by joining data from other documents. For example, with Solr joins, you can return child documents of parents that match your search criteria. One example where Solr joins are useful would be grouping all retweets of a Twitter message into a single group.
Document clustering allows you to identify groups of documents that are similar, based on the terms present in each document. This is helpful to avoid returning many documents containing the same information in search results. For example, if your search engine is based on news articles pulled from multiple RSS feeds, then it's likely that you'll have many documents for the same news story. Rather than returning multiple results for the same story, you can use clustering to pick a single representative story.
Importing common document formats like PDF and Word
In some cases, you may want to take a bunch of existing documents in common formats like PDF and Microsoft Word and make them searchable. With Solr this is easy because it integrates with the Apache Tika project that supports most popular document formats.
Importing data from relational databases
If the data you want to search with Solr is in a relational database, then you can configure Solr to create documents using a SQL query.
Solr and Lucene have a long history of working with multiple languages. Solr has language detection built-in and provides language-specific text analysis solutions for many languages.
New features in Solr 4
Let's look at a few of the exciting new features in Solr 4. In general, version 4 is a huge milestone for the Apache Solr community as it addresses many of the major pain-points discovered by real users over the past several years. We selected a few of the main features to highlight:
- Near-real-time search
- Atomic updates with optimistic concurrency
- Real-time get
- Write durability using a transaction log
- Easy sharding and replication using Zookeeper
Solr's Near-Real-Time (NRT) search feature supports applications that have a high velocity of documents that need to be searchable within seconds of being added to the index. With NRT, you can use Solr to search rapidly changing content sources such as breaking news and social networks.
Atomic updates with optimistic concurrency
The atomic update feature allows a client application to add, update, delete, and increment fields on an existing document without having to resend the entire document. For example, if the price of a home in our example real estate application changes, then we can send an atomic update to Solr to change the price field specifically.
You might be wondering what happens if two different users attempt to change the same document concurrently. In this case, Solr guards against incompatible updates using optimistic concurrency. In a nutshell, Solr uses a special version field named _version_ to enforce safe update semantics for documents. In the case of two different users trying to update the same document concurrently, the user that submits updates last will have a stale version field so their update will fail.
Solr is a NoSQL technology. Solr's real-time get feature definitely fits within the NoSQL approach by allowing you to retrieve the latest version of a document using its unique identifier regardless of whether that document has been committed to the index. This is similar to using a key-value store like Cassandra to retrieve data using a row key.
Prior to Solr 4, a document wasn't retrievable until it was committed to the Lucene index. With the real-time get feature in Solr 4, you can safely decouple the need to retrieve a document by its unique ID from the commit process. This can be useful if you need to update an existing document after it's sent to Solr without having to do a commit first.
When a document is sent to Solr for indexing, it's written to a transaction log to prevent data loss in the event of server failure. Solr's transaction log sits between the client application and the Lucene index. It also plays a role in servicing real-time get requests as documents are retrievable by their unique identifier regardless of whether they're committed to Lucene.
The transaction log allows Solr to decouple update durability from update visibility. This means that documents can be on durable storage but aren't visible in search results yet. This gives your application control over when to commit documents to make them visible in search results without risking data loss if a server fails before you commit.
Easy sharding and replication with Zookeeper
If you're new to Solr, then you may not be aware that scaling previous versions of Solr was a cumbersome process at best. With SolrCloud, scaling is simple and automated because Solr uses Apache Zookeeper to distribute configuration and manage shard leaders and replicas. The Apache website (zookeeper.apache.org) describes Zookeeper as a "centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services."
In Solr, Zookeeper is responsible for assigning shard leaders and replicas and keeps track of which servers are available to service requests. SolrCloud bundles Zookeeper so you don't need to do any additional configuration or setup to get started with SolrCloud.
Solr is optimized to handle data that's text-centric, read-dominant, document-oriented, and has a flexible schema. Search engines like Solr aren't general-purpose data storage and processing solutions but are intended to power keyword search, ranked retrieval, and information discovery. Solr builds upon Lucene to add declarative index configuration and web services based on HTTP, XML, and JSON. Solr 4 can be scaled in two dimensions to support millions of documents and high-query traffic using sharding and replication. Solr 4 has no single points of failure. In this article, we touched on some of Solr's main features.
Hadoop in Action
Mahout in Action
Tika in Action