Our reader (greetings!) reported a problem with the cooperation of DIH and sharding mechanism. The Solr project wiki, in my opinion, discusses the solution to this issue, but makes it a little hard to get.
What is sharding?
Sharding means the division of data into several parts and the storage and processing of the data independently. The additional logic within the application allows you to select the appropriate part of data and/or pooling results from various sources. In the case of DIH and sharding we have to deal with the following case:
- sharding on the side of the data source – this means multiple locations/tables with different parts of the data set
- sharding on the SOLR side – that is, dividing the data from a source on many independent instances of SOLR
- both of these simultaneously
In our case we have one set of data and we want to create a lot of sets (called shards) on the SOLR side.
When to use sharding?
A very important question: why we use the sharding mechanism ? In my opinion sharding happens to be abused too often and thus generates lots of additional complications and limitations. The main reason to use sharding is the large volume of data that make SOLR indicies not fall within one machine. If it does not – it often means that sharding is redundant. Another reason is performance. But sharding can help here only if other optimizations fail and the queries are so complicated that the same addintional cost of sharding (forward queries to the individual Shards and combining their answers) is less than the profit performance that can be achieved.
Let’s assume that we need sharding. In the example below, I used data from the MusicBrainz creating a simple postgresql table:
The table contains 825,661 records. I stress here that both the structure and amount of data is so small that the practical usefulness of using sharding here is negligible.
For the tests we use three instances of SOLR. All instances are identical, the difference is related only to the number of ports (8983, 7872, 6761) – Tests will be performed on one physical machine.
Definition at schema.xml:
<fields> <field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="album" type="text" indexed="true" stored="true" multiValued="true"/> <field name="author" type="text" indexed="true" stored="true" multiValued="true"/> </fields> <uniqueKey>id</uniqueKey> <defaultSearchField>album</defaultSearchField>
Definition of DIH in solrconfig.xml:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">db-data-config.xml</str> </lst> </requestHandler>
And the file DIH db-data-config.xml:
<dataConfig> <dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/shardtest" user="solr" password="secret" /> <document> <entity name="album" query="SELECT * from albums"> <field column="id" name="id" /> <field column="name" name="album" /> <field column="author" name="author" /> </entity> </document> </dataConfig>
At this point, each instance is unable to complete the data import.
So let’s setup sharding
Our goal is to modify the configuration such that each instance of DIH index only “their” part of the data. The easiest way to do this is by modifying the query retrieving data to the one like this:
SELECT * from albums where id % NUMBER_OF_INSTANCES = INSTANCE_NUMBER
- NUMBER_OF_INSTANCES – the number of Solr servers that store the number of unique parts of the data set
- INSTANCE_NUMBER – instance number (starting from zero)
such query does not guarantee exactly and perfectly equal distribution but satisfies two necessary conditions:
- the record will always go to a specific and always the same instance
- single record will always go to only one instance
so the db-data-config.xml on each machine is different now and looks like this:
- SELECT * from albums where id % 3 = 0
- SELECT * from albums where id % 3 = 1
- SELECT * from albums where id % 3 = 2
How it works
After starting up each of the Solr instances we run the following query on each of them:
When DIH command ends we send the following command:
We should get the following responses:
- Added/Updated: 275220 documents.
- Added/Updated: 275221 documents.
- Added/Updated: 275220 documents.
Performing a simple insert operation, we see that in all instances we
have a total of 825,661 documents, as much as there should be
Make another request – ask for all document. Using sharding we can send the following query to any instance: