Over a million developers have joined DZone.

Using system disk cache for speeding up the indexing with SOLR

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Benchmarking is rather hard subject of software development, especially in a sand-boxed development environments, like JVM with "uncontrolled" garbage collection. Still, there are tasks, that are more IO heavy, like indexing xml files into Apache Solr and this is where you can control more on the system level to do better benchmarking.

So what about batch indexing? There are ways to speed it up purely on SOLR side.

This post shows a possible remedy to speeding up indexing purely on the system level, and assumes linux as the system.

The benchmarking setup that I had is the following:

Apache SOLR 4.3.1
Ubuntu with 16G RAM
Committing via softCommit feature

What I set up to do is to play around the system disk cache. One of the recommendations of speeding up the search is to cat the index files into the cache, using the command:

cat an_index_file > /dev/null

Then the index is read from the disk cache buffers and is faster than reading it cold.

What about bulk indexing xml files into Solr? We could cat the xml files to be indexed into the disk cache and possibly speed up the indexing. The following figures are not exactly statistically significant, nor was the test done on a large amount of xml files, but the figures do show the trend:

With warmed up disk cache:
real  1m27.604s
user  0m2.220s
sys  0m2.860s

After dropping the file cache:
echo 3 | sudo tee /proc/sys/vm/drop_caches

real  1m30.285s
user  0m2.148s
sys  0m3.700s

Again, hot cache:
real  1m27.924s
user  0m2.264s
sys  0m3.068s

Again, after dropping the file cache:
echo 3 | sudo tee /proc/sys/vm/drop_caches

real  1m32.791s
user  0m2.204s
sys  0m3.104s

The figures above are pretty clear, that having the files cached speeds the indexing up by about 3-5 seconds for just 420 xml files.

Coupled with ways of increasing the throughput on the SOLR side this approach could win some more seconds / minutes / hours in the batch indexing.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:

Published at DZone with permission of Dmitry Kan, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}