Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using system disk cache for speeding up the indexing with SOLR

DZone's Guide to

Using system disk cache for speeding up the indexing with SOLR

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Benchmarking is rather hard subject of software development, especially in a sand-boxed development environments, like JVM with "uncontrolled" garbage collection. Still, there are tasks, that are more IO heavy, like indexing xml files into Apache Solr and this is where you can control more on the system level to do better benchmarking.

So what about batch indexing? There are ways to speed it up purely on SOLR side.

This post shows a possible remedy to speeding up indexing purely on the system level, and assumes linux as the system.

The benchmarking setup that I had is the following:

Apache SOLR 4.3.1
Ubuntu with 16G RAM
Committing via softCommit feature

What I set up to do is to play around the system disk cache. One of the recommendations of speeding up the search is to cat the index files into the cache, using the command:

cat an_index_file > /dev/null

Then the index is read from the disk cache buffers and is faster than reading it cold.

What about bulk indexing xml files into Solr? We could cat the xml files to be indexed into the disk cache and possibly speed up the indexing. The following figures are not exactly statistically significant, nor was the test done on a large amount of xml files, but the figures do show the trend:

With warmed up disk cache:
real  1m27.604s
user  0m2.220s
sys  0m2.860s

After dropping the file cache:
echo 3 | sudo tee /proc/sys/vm/drop_caches

real  1m30.285s
user  0m2.148s
sys  0m3.700s

Again, hot cache:
real  1m27.924s
user  0m2.264s
sys  0m3.068s

Again, after dropping the file cache:
echo 3 | sudo tee /proc/sys/vm/drop_caches

real  1m32.791s
user  0m2.204s
sys  0m3.104s

The figures above are pretty clear, that having the files cached speeds the indexing up by about 3-5 seconds for just 420 xml files.

Coupled with ways of increasing the throughput on the SOLR side this approach could win some more seconds / minutes / hours in the batch indexing.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:

Published at DZone with permission of Dmitry Kan, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}