Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using SolrCloud to Clone Your Search Index

DZone's Guide to

Using SolrCloud to Clone Your Search Index

· Big Data Zone
Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

The search engine GPSN has a fairly static dataset of Chinese patents that is delivered via roughly 5 TB worth of very large Zip files. The data for each patent is split among various zip files for English text, Chinese text, and image files. And each has its own vagaries of how they are stored. We wanted to be able to query for files and find out what Zip archives they were stored in when debugging issues arise in matching up all the data files. We ended up using Solr to save us from configuring some sort of more traditional database, because this was the only use case, though at times I wish I had something a bit more expressive to work with!

Well, today I finally got to grips with putting a script together that would go over all the stored data and check it for various error conditions that we have identified over the course of the project. Initially, the data_audit index was both what I was querying for my information about where a patent might have gone, and updating the associated metadata. The commits are set up to every 15 seconds, and queries were easily taking 15 to 20 seconds to occur under some decent, but not silly, amounts of load due to the the constant commit activity, and the caches never having a chance to fill up.

So, I refactored my audit code to have two properties, a solrDataWriter, which would be my primarily Solr index, and then a solrDataReader, which would be a clone of my data_audit index.

I didn’t want to go through the leg work of stamping out lots of Solrs, but fortunately, my colleague @omnifroodle had been working over the past two months on some Amazon CloudFormation scripts for SolrCloud. While still raw, they are available at http://github.com/o19s/cfn-solr.

I ran the CloudFormation template and stood up three SolrCloud servers. I then put my data_audit conf directory into ZooKeeper and created a single shard with two replicas. I was thinking of it being like RAID levels. In my case, I only wanted performance of reads, minimal work to set up, and I am not worried about writes being propagated.

vectors are fun

My “RAID” of shards

Next was how to get the data over. Well, replication has always been one of my favorite tools. So, I used curl and copied the data over:

curl "http://ec2-68-202-9-80.compute-1.amazonaws.com:8983
/solr/data_audit_raid2_shard1_replica1/replication?command=fetchindex&
masterUrl=http://ec2-23-21-214-112.compute-1.amazonaws.com/solr/data_audit"

Now, replication is actually meant in SolrCloud-land to manage moving bulk amounts of data from the leader shard to the replicas when things get out of sync, or to update a new replica, but invoking it manually seemed to work as well. The GUI doesn’t show all the details, so browse the replication handler directly to monitor the progress:

http://ec2-68-202-9-80.compute-1.amazonaws.com:8983
/solr/data_audit_raid2_shard1_replica1/replication?command=details

I did hope that if I replicate to the leader shard, it would in turn replicate to all the follow shards, but no joy.


Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.

Topics:

Published at DZone with permission of Eric Pugh, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}