Apache Solr powers enterprise search on sites from Ebay to Zappos. It also powers Carsabi, but when we reached 1.8M listings per month (passing Autotrader & Cars.com) our basic installation began to run about as fast as an octogenarian in congealing cement. I’d like to share the basics of Solr optimization, as well as some data on real world gains.
Very briefly, our stack has gone through a few iterations which may be sufficient for your corpus volume – no sense in over-engineering. Postgres tables had to be denormalized at 100k vehicles, and we switched to WebSolr’s extremely convenient Solr solution at 300k – their Heroku plugin will create an installation in minutes for just $20/month. This worked very well until about 1M listings, at which point even their beefiest plan was returning results with >800ms latency.
Hardware: Bigger Is Better. A Lot Better
Our previous Solr-as-a-Service had been hosted on an Amazon EC2 Large instance and returned in 800ms. Fortunately, we had spare capacity on an EC2 Cluster Compute Eight Extra Large, which we use for our webcrawler, and just moving to this machine dropped our query time to 282ms – a speed increase of 2.84x. Notice this corresponds to the processor speed increase of 2.75x between a Large and CC8XL, not the 22x gain in total compute units. Memory appears to be equally irrelevant with both the Large and CC8XL easily keeping our 3GB index in RAM. However, do make sure to give Solr sufficient memory by adjusting the JVM heap size via the -Xmx option.
Software: Shard that Sh*t
282ms is pretty good, but I wanted better - Solr was still responsible for over 50% of our user latency. Google was consulted with surprising results: even if you have just one server, you should still shard your workload. This struck me as odd. Surely Solr is multi-threaded, so why the difference? A quick look at top told me this wasn't the case: Solr's CPU usage never went above 100% (on one core) even though our server had 16 physical cores. A couple hours later, and now with our index spread over 8 shards, our query time was down to 43ms! Also, top showed Solr's CPU usage at 483%, so it was clearly using multiple cores.
I ran some benchmarks, and the following results show solid gains up to 8 shards. If you’re interested in how to shard your own index, I’ve published a brief summary here. The test was run on a representative query from our workload using sorting on two dimensions, a geo bounding box, four numeric range filters, one datetime range filter, and a categorical range filter. I don't know if the fact that it levels off after 8 is due to our specific dataset, or the fact that our server has dual 8-core processors – however, if you have any insight please shoot me a line! *edit* or join in the discussion on HN