Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using system disk cache for speeding up the indexing with SOLR

DZone's Guide to

Using system disk cache for speeding up the indexing with SOLR

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

Benchmarking is rather hard subject of software development, especially in a sand-boxed development environments, like JVM with "uncontrolled" garbage collection. Still, there are tasks, that are more IO heavy, like indexing xml files into Apache Solr and this is where you can control more on the system level to do better benchmarking.

So what about batch indexing? There are ways to speed it up purely on SOLR side.

This post shows a possible remedy to speeding up indexing purely on the system level, and assumes linux as the system.

The benchmarking setup that I had is the following:

Apache SOLR 4.3.1
Ubuntu with 16G RAM
Committing via softCommit feature

What I set up to do is to play around the system disk cache. One of the recommendations of speeding up the search is to cat the index files into the cache, using the command:

cat an_index_file > /dev/null

Then the index is read from the disk cache buffers and is faster than reading it cold.

What about bulk indexing xml files into Solr? We could cat the xml files to be indexed into the disk cache and possibly speed up the indexing. The following figures are not exactly statistically significant, nor was the test done on a large amount of xml files, but the figures do show the trend:

With warmed up disk cache:
real  1m27.604s
user  0m2.220s
sys  0m2.860s

After dropping the file cache:
echo 3 | sudo tee /proc/sys/vm/drop_caches

real  1m30.285s
user  0m2.148s
sys  0m3.700s

Again, hot cache:
real  1m27.924s
user  0m2.264s
sys  0m3.068s

Again, after dropping the file cache:
echo 3 | sudo tee /proc/sys/vm/drop_caches

real  1m32.791s
user  0m2.204s
sys  0m3.104s

The figures above are pretty clear, that having the files cached speeds the indexing up by about 3-5 seconds for just 420 xml files.

Coupled with ways of increasing the throughput on the SOLR side this approach could win some more seconds / minutes / hours in the batch indexing.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:

Published at DZone with permission of Dmitry Kan, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}