Handling Shards in SolrCloud
Handling Shards in SolrCloud
There are lots of possibilities when it comes to shard creation. In this article, learn how to handle replica creation in SolrCloud.
Join the DZone community and get the full member experience.Join For Free
Learn how to migrate and modernize stateless applications and run them in a Kubernetes cluster.
One of the things that you learn when attending Sematext Solr training is how to scale Solr. We discuss various topics regarding leader shards and their replicas – things like when to go for more leaders, when to go for more replicas, and when to go for both. We discuss what you can do with them, how to control their creation, and how to work with them. I would like to focus on one of the mentioned aspects: handling replica creation in SolrCloud. Note that while this is not limited to Solr on Docker deployment; if you are considering running Solr in Docker containers, you will want to pay attention to this.
When creating a collection in SolrCloud, we can adjust the creation command. Some of the parameters are mandatory, but some of them have defaults and can be overwritten. The two main parameters we are interested in are the number of shards and the replication factor. The former tells Solr how to divide the collection – how many distinct pieces (shards) our collection will be split into. For example, if we say that we want to have four shards, Solr will divide the collection into four pieces, with each piece having about 25% of the documents. The replication factor, on the other hand, dictates the number of physical copies that each shard will have. So, when the replication factor is set to one, only leader shards will be created; if the replication factor is set to two, each leader will have one replica; if replication factor is set to three, each leader will have two replicas; and so on.
By default, Solr will put one shard of a collection on a given node. If we want to have more shards than the number of nodes we have in the SolrCloud cluster, we need to adjust the behavior, which we also can do by using Collections API. Of course, Solr will try to spread the shards evenly around the cluster, but we can also adjust that behavior by telling Solr on which nodes shards should be created.
Let’s look at this next. For the purpose of showing you how all of this works, I’ll use Docker and Docker Compose. I’ll launch four containers with SolrCloud and one container with ZooKeeper. I’ll use the following docker-compose.yml file:
version: "2" services: solr1: image: solr:6.2.1 ports: - "8983:8983" links: - zookeeper command: bash -c '/opt/solr/bin/solr start -f -z zookeeper:2181' solr2: image: solr:6.2.1 links: - zookeeper - solr1 command: bash -c '/opt/solr/bin/solr start -f -z zookeeper:2181' solr3: image: solr:6.2.1 links: - zookeeper - solr1 - solr2 command: bash -c '/opt/solr/bin/solr start -f -z zookeeper:2181' solr4: image: solr:6.2.1 links: - zookeeper - solr1 - solr2 - solr3 command: bash -c '/opt/solr/bin/solr start -f -z zookeeper:2181' zookeeper: image: jplock/zookeeper:3.4.8 ports: - "2181:2181" - "2888:2888" - "3888:3888"
Start all containers by running:
$ docker-compose up
The command needs to be run in the same directory where the docker-compose.yml file is located. After it finishes, we should see the following nodes in our SolrCloud cluster:
In order to create a collection, we will need to upload a configuration. I’ll use one of the configurations provided by default with Solr and will upload them to ZooKeeper with this command:
$ docker exec -it --user solr 56a02fb8d997 bin/solr zk -upconfig -n example -z zookeeper:2181 -d server/solr/configsets/data_driven_schema_configs/conf
The 56a02fb8d997 is the identifier of the first Solr container. BTW, if you are not familiar with Solr and Docker together, we encourage you to look at the slides of our Lucene Revolution 2016 talk, How to Run Solr on Docker and Why. The video should also be available soon.
We can finally move to collection creation. Let’s try to create a collection built of four primary shards. That is easy, we just run:
$ curl 'localhost:8983/solr/admin/collections?action=CREATE&name=primariesonly&numShards=4&replicationFactor=1&collection.configName=example'
The collection view, once the command is executed should look as follows:
We can, of course, create collection divided into fewer leader shards, but with replicas:
$ curl 'localhost:8983/solr/admin/collections?action=CREATE&name=primariesandreplicas&numShards=2&replicationFactor=2&collection.configName=example'
We'll have more shards than we have nodes:
$ curl 'localhost:8983/solr/admin/collections?action=CREATE&name=primariesandreplicasmore&numShards=4&replicationFactor=2&maxShardsPerNode=2&collection.configName=example'
The key point here is this: once you've created a collection, you can add more replicas, but the number of leader shards will stay the same. This statement is true even if we are using the compositeId router and if we don’t split shards. So, how can we add more replicas? Let’s find out.
Adding Replicas Manually
The first idea is to add replicas manually by specifying which collection and shard we are interested in and where the replica should be created. Let’s look at the collection create with the following command:
$ curl 'localhost:8983/solr/admin/collections?action=CREATE&name=manualreplication&numShards=2&replicationFactor=1&collection.configName=example'
To add a replica, we need to say on which node the new replica should be placed and which shard it should replicate. Let’s try creating a replica for shard1 and place it on 172.19.0.3. To do that, we run the following:
$ curl 'localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=manualreplication&shard=shard1&node=172.19.0.3:8983_solr'
After executing the above command and waiting for the recovery process to finish, we would see the following view of the collection:
Of course, you may ask where to get the node name. You can retrieve it from the /live_nodes node in ZooKeeper:
Please remember that specifying the node name is not mandatory. We may actually let Solr choose the node for us. We still have to manually add replicas when we add machines, though. Can this be changed? Automated?
Automatically Adding Replicas
It is possible to automatically create new replicas when SolrCloud is working on top of a shared file system. We do that by adding the autoAddReplicas=true to the collection creation command, just like this:
$ curl 'localhost:8983/solr/admin/collections?action=CREATE&name=autoadd&numShards=2&replicationFactor=1&autoAddReplicas=true'
On a shared file system, Solr will automatically create replicas. Unfortunately, it is not possible to automatically spread Solr collection when new nodes are added to the cluster when Solr is not running on a shared file system.
Controlling Shard Placement
We’ve mentioned earlier that we can assign shards when creating the collection. We can do it in a semi-manual fashion or using rules. The semi-manual way is very simple. For example, let’s create a collection that is placed on nodes 172.19.0.3 and 172.19.0.4 only. We can do that by running the following:
$ curl 'localhost:8983/solr/admin/collections?action=CREATE&name=manualplace&numShards=2&replicationFactor=1&createNodeSet=172.19.0.3:8983_solr,172.19.0.4:8983_solr'
The resulting collection would look as follows:
We can do more than this. We can define rules that will be used by Solr to place shards in our SolrCloud cluster. There are three possible conditions for rules, which should be met for the replica to be assigned to a given node. Those conditions are:
- Shard, the name of the shard or wild card; tells Solr to which shards the rule should be applied; if not provided, it will be used for all shards.
- Replica, a * wild card or number from zero to infinity.
- Tag, an attribute of a node, which should match in a rule.
There are also rule operators, which include:
- Equal, for example tag:xyz, which means that tag property must be equal to xyz.
- > (greater than), which means that the value of the property must be higher than the provided value.
- < (less than), which means that the value of the property must be lower than the provided value.
- !, meaning not equal, which means that the value of the property must be different than the provided value.
We will see how to use these rules and operators shortly.
To fully use the above rules and operators, we need to learn about snitches. Snitches are values coming from plugins that implement Snitch interface. There are a few of those provided by Solr out of the box:
- Cores, the number of cores in a node.
- Freedisk, the disk space available on a node (in GB).
- Host, the name of the host on which the node works.
- Port, the port of the node.
- Node, the node name.
- Role, the role of the node; during the writing of this article, the only possible value here is "overseer."
- ip_1, ip_2, ip_3, ip_4, part of IP address (for example, in 172.19.0.2, the ip_1 is 2, ip_2 is 0, ip_3 is 19 and ip_4 is 172).
- sysprop.PROPERTY_NAME, the property name provided by -Dkey=value during node startup.
We configure snitches when we create the collection.
For example, to create a collection and only place shards on nodes that have at least 30GB of disk space, and have IP that starts with 172 and ends with 4, we would use the following collection create command:
$ curl 'localhost:8983/solr/admin/collections?action=CREATE&name=rulescollection&numShards=4&replicationFactor=1&maxShardsPerNode=4&rule=shard:*,replica:*,ip_1:4&rule=shard:*,replica:*,ip_4:172&rule=shard:*,replica:*,freedisk:>30'
The created collection would look as follows:
The rules are persisted in ZooKeeper and will be applied to every shard. That means that during replica placement and shard splitting, Solr will use the stored rules. Also, we can modify the rules by using collection modify command.
As you can see, we have lots of possibilities when it comes to shard creation, from simple control to sophisticated rules based on out of the box values and ability to provide our own. For me, rules are the missing functionality in Solr that I was looking for when SolrCloud was still young. I hope it will be known to more and more users.
Published at DZone with permission of Rafał Kuć , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.