Guerilla Search with Solr - How to run a 3 millions documents search on a $15/Month machine.
Guerilla Search with Solr - How to run a 3 millions documents search on a $15/Month machine.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Gwittr a twitter search and stats site that provides an extended search of your tweets and their linked web-pages as well as profiling statistics. This article will highlight the challenges and options to run a medium large scale search (over 3 million documents) on very cheap (< $15/Month) machines. The information came about building.
- throwing the problem at the cloud is neither cheap nor necessarily the solution.
- Avoid overpaying for unnecessary storage space.
- Understand your document fields definition and optimise for your search needs.
- Craft your queries with love and care.
- Master your commit strategy.
Throw the Problem at the cloud?
In this age of Cloud computing and EaaS (Everything as a Service) it’s very tempting for companies with products requiring search features to just use hosted search services. The cents a second sound trivial however as the system scales it can easily tally to monthly bills in the hundreds or thousands of dollars.
A way to avoid those costs is to run your own Solr installation on vanilla hardware or virtual boxes. Not only it will save you a great deal of money, but you will also gain valuable search engine skills and knowledge that you can leverage to limit your spendings if you move to another search platform.
At Gwittr, we run on plain hand-rolled Solr instances sitting on very affordable boxes and still we can output fairly advanced stats about our data without any significant sluggishness . Here are a few principles we follow.
Search is not storage
A Search Engine like Solr is not a database store. It gives you indices on steroid. If you forget that and consider your search indices as your primary storage, then you risk:
- Data loss. Although Solr does implement some data integrity techniques, persistence is not a strong property of those systems.
- For streaming data search like Gwittr the storage dedicated to Search will grow rapidly. If you’re using SaaS, that means you will end up paying significant money for storing your data in Solr as well as in your primary storage.
- Loss of agility. Re-indexing to support new features is inevitable. If you don’t plan for this then you will lose a lot of your release agility.
Optimization #1: Consider your search indices to be a disposable and easily rebuildable resource as your application will definitely have to re-index everything from time to time when you introduce new features.
Make all the fields in your schema non stored by default. It’s perfectly fine to use features like faceting on non-stored fields. The main valid reason to store a document field in Solr is when you want to use the highlighting feature , as Solr needs the original text of your document to output some highlighted snippets. You also want to store a couple more things, like your document identifier(s) as you probably will have to use that to link back your search results to your primary storage in your application code.
Solr also provides quite an extensive set of field indexing options that will help you reduce your index footprint even further.
Browsing VS Searching.
Although Solr, Lucene and the rest of the family are marketed as ‘Search Engines’, it’s probably more correct to say that they are very good browsing Engines (with faceting being a strong selling point) with excellent full text search capabilities compared to what you would get from an open source database system. If you look at how your user experience is designed (and at how Web Crawlers will see your site), and unless you are Google, you’ll probably find that most of the time, your users click around on your navigational features (facets, similar documents..) after they made their first keywords search. At least that’s the case on Gwittr where visitors can see all results and drill down through them without entering any search keyword.
Optimization #2: For your “Browsing” related queries, it’s always better to use Solr’s filters instead of stuffing everything in your “q” parameter. Solr filtered document sets are cached and they stay away from any relevance scoring computations, so using them for your browsing queries will save you valuable I/O and CPU cycles.
Also search engines are not meant to display pages of results very far in your matching set, as the further you go, the more temporary memory is required and the slower it gets. For instance, Google doesn’t show you any result beyond the 1000th page or even earlier.
Optimization #3 implement pagination limits in your application.
Optimization #4 request only the fields you need to display your results, thus minimizing I/O and bandwidth.
Solr Commit is not RDBMS Commit.
With databases, we use transactions and commits concurrently all the time, because it’s the right way to enforce data integrity when an update involves more than one row or table. It means “our view of the data has reached a consistent state, please propagate this state change to the rest of the world”. In Solr, “commiting” has got very different semantic.
As you most probably know by now, there is no such thing as “Update”, “Data integrity foreign keys” or “Multiple tables” in Solr. At heart, Solr/Lucene just manages an ever growing collection of documents in their indexed form. Every time you add, update or delete a collection of documents, Solr adds a new “Segment” (a bunch of files) to its data directory. Eventually the number of segments grows big. There a mechanism to counteract that, but that’s not the point here.
In Solr, all the search queries are handled by a Searcher object. A Searcher is built on top of the collection of segments the index is composed of. What a commit means in this context is simply: “Please Solr, build a new Searcher that includes the fresh new segments, and atomically replace the current Searcher with it”.
Don’t step on my toes.
Optimization #5 avoid at all cost committing to Solr in a concurrent way, as you would just keep building new Searchers just to throw them away the second after. In fact, concurrently building searcher is so bad that there is an explicit setting in Solr’s configuration to hard cap this number. The default is 2. So if you commit concurrently you will most likely get nice exception stack traces complaining about too many opened searchers.
Optimization #6 Monitor the time it takes to build a new Searcher. Optimising Solr’s responsiveness to new/updated document (hipsters call that “Real Time”) boils down to minimising the time it takes to build a new Searcher object. Here is a tip: Monitor your Solr Logs, greping for “event=newSearcher” and look for the QTime (Query Time) of those lines. Your goal is to make this time as short as reasonably possible (we will see why ‘reasonable’ is important here later), as the faster it is to build a new searcher, the more often you can do it, and the more responsive your search become to insert, updates and deletes.
There’s two main strategies to issue commits in Solr. The first one and probably the one you should look at as a first approach is to let Solr do it at regular intervals. It’s called Autocomit and it’s great as it relieves your application from managing it. In fact, if you use Autocommit, then it becomes a very bad idea to let your application issue commits itself. Remember the cap on overlapping searcher. This apply to autocommit searchers too, so make your autocommit intervals longer than your Searcher building time. One thing against Auto Committing at regular intervals is that when there’s no updates on your index, then regularly building new searchers is just a waste of CPU. That points us to the second strategy about committing:
Optimization #7 Let your application do commits as needed when needed. Just keep in mind that concurrent commits are a bad idea and implement a global locking mechanism. Then you should be just fine.
Blowing hot and cold.
Now you might think “oh well, how slow can it be to build a new Searcher with just one more segment? Surely Solr is written well enough for this to be very fast”. You are right. It is very fast.
The only problem is that the first few queries on this searcher will be very slow. And this is bad. In a high volume search context, a few sluggish queries is all it takes to potentially bring your product on its knees as resource starvation will kick in your application layers. The reason behind these first queries slow down is that a fresh Searcher’s caches are not populated with anything useful. In Solr terms, this is called a ‘Cold searcher’. Solr allows you to use cold searchers, but fortunately it’s only when no other searcher is registered. That means it happens only on just started instances of Solr. For all the other cases, Solr provides some mechanisms to warm the searchers up so they are nice and hot when they are promoted to a request-serving role.
Optimization #8 There’s two sets of settings influencing the warming up of a new searcher, and you should use a combination of both.
- One is to set Solr to issue queries against the warming Searcher. One idea for these queries is to sample a few typical queries from your live application and make them a bit more general by removing a filter for instance. An important thing to do is to include most of the facets you will be using in your application. You can also issue a few keyword queries, as this would load the full text indices in memory if there is space enough.
- Another way to warm new Searchers is to set-up autowarming on Caches. Cache autowarming is simply a way to reuse values from old caches to pre-populate values in your warming Searcher’s caches.
The key about warming searchers is to find the right balance between the time it takes to build a new Searcher (remember it can be almost instant - but dangerous) and the amount of slow down you can afford when your application hits a freshly registered Searcher. Finding this sweet spot requires experimenting, as it all depends on what your application layer requires and is capable to stomach.
With a deep enough knowledge of a search product, and some fun experimenting, it’s perfectly possible to squeeze a lot of performance from cheap hardware, avoiding the costs involved with relying only on SaaS. Also, knowing the inner working of a system is a great way to make the right decision about the settings and the usage strategies to apply when you move to a SaaS platform. SaaS is a great way to avoid all the headaches associated with scaling and replication. But don’t ignore these services internals altogether, or you will be exposing yourself to underperformance and overspending.
About the author:
Jerome Eteve is a full stack senior web application developer based in London. Over his career he reviewed seminal books about Solr and implemented custom search solutions in a variety of high volume products.
Richard Donovan is an integration architect including search, big data and complex business processes
Opinions expressed by DZone contributors are their own.