Using Apache Solr in Production
Using Apache Solr in Production
Deploy Solr in production and handle Solr collections.
Join the DZone community and get the full member experience.Join For Free
Solr is a search engine built on top of Apache Lucene. Apache Lucene uses an inverted index to store documents(data) and gives you search and indexing functionality via a Java API. However, to use features like full text, you would need to write code in Java.
Solr is a more advanced version of Lucene’s search. It offers more functionality and is designed for scalability. Solr comes loaded with features like Pagination, sorting, faceting, auto-suggest, spell check, etc. Also, Solr uses a trie structure for numeric and date data types e.g. there is a normal int field and another tint field, which represents the trie int field.
Solr is fast for text searching/analyzing; credit goes to its inverted index structure. If your application requires extensive text searching, Solr is a good choice. Several companies like Netflix, Verizon, AT&T, and Qualcomm use Solr as their search engine. Even Amazon Cloudsearch, which is a search engine service by AWS, uses Solr internally.
This article provides a method to deploy Solr in production and deals with creating Solr collections. If you are just starting with Solr, you should start by building a Solr core. Core is a single node Solr server, with no shards and replicas, while collections consist of various shards and its replicas, which are the cores.
In a distributed search, a collection is a logical index across multiple servers. The part of each server that runs a collection is called a core. So, in a non-distributed search, a core and a collection are the same because there is only one server.
In production, you need a collection to be implemented rather than a Solr core because a core won’t be able to hold production data (unless you do vertical scaling). Apache Zookeeper helps create a connection across multiple servers.
There are two ways you can set this up:
- Multiple Solr servers and use Zookeeper on one of the servers.
- Zookeeper on a different server with all the other Solr servers connecting to it.
We’ll go through the process of implementing using the second approach. The first approach is similar to the second one, but the latter is a more scalable approach.
Spawn up three servers and install Solr on two of them. (Note: you can spawn any number of Solr servers – we use three in our example). To install Solr, you need to install Java first; then, download the desired version and untar it.
Installation: wget http://archive.apache.org/dist/lucene/solr/8.1.0/solr-8.1.0.tgz
Untar: tar -zxvf solr-8.1.0.tgz
You can start Solr by going to the /home/ubuntu/solr-8.0.0 folder with bin/solr startor in the bin folder with ./solr start. This would start Solr on port 8983, and you can test it in the browser.
Replicate the exact same steps to install Solr on your second server.
Also, remember to set up the list of IP’s and names for each in /etc/hosts
IPv4 Public IP-solr-node-1 solr-node-1 IPv4 Public IP-solr-node-2 solr-node-2 IPv4 Public IP-zookeeper-node zookeeper-node
Now, the third server would require only ZooKeeper to which you would push the configsets.
Untar : tar -zxvf zookeeper-3.4.9.tar.gz
If you like, you can add the path to ZooKeeper to the bashrc file.
Next, in the zookeeper-3.4.9 folder, there is a sample configuration file that comes with zookeeper -> zoo_sample.cfg. Copy this file in the path and rename it to zoo.cfg. The configuration file contains various parameters like dataDir, which specifies the directory to store the snapshots of in-memory database and transaction logs, maxClientCnxns, which limits the max number of client connections.
Open the zoo.cfg file and uncomment autopurge.snapRetainCount=3 and autopurge.purgeInterval=1and edit the dataDir = data.
Next, start ZooKeeper.
Creating A Configset
Configsets are basically the blueprint of the data to be stored. Configsets are stored at server/solr/configsets
You can create your own configset and use it to store your data. Change the managed-schema file content to customize the config.
- You can modify the <field> tag to denote the data fields to be stored in one document.
- You can define the type or create a new type by defining it with the <fieldType> tag.
- The id field is compulsory, so you cannot delete it.
There are many other things you use in Solr like dynamic fields, copy fields, etc. Explaining each of them is beyond the scope of this blog, but for more information, here is the official documentation.
Now that you’ve created a config and have chmod -R 777 config folder, push the config to ZooKeeper.
bin/solr zk upconfig -n config_folder_name -d /solr-8.0.0/server/solr/configsets/config_folder_name/ -z zookeeper-node:2181
After pushing the config, start SolrCloud on each Solr server. To install SolrCloud, refer to this documentation.
Connecting to Zookeeper
To connect to ZooKeeper:
bin/solr start -cloud -s example/cloud/node1/solr/ -c -p 8983 -h solr-node-1 -z zookeeper-node:2181
Solr stores the inverted index at this location: example/cloud/node1/solr/, so you need to mention that path while connecting. ZooKeeper will automatically distribute shards and replicas over the two Solr servers. When you add some data, a hash will be generated. This is all handled by ZooKeeper.
To add data to the server, you need to POST to the link http://<IP>:8983/solr/<collection_name>/update?commit=true.
The IP can be from any server as the data automatically gets distributed among the shards.
To get data from your solr, search http://<IP>:8983/solr/user/select?q=<searchString>.
Note: If you are using one of the Solr servers as a zookeeper, all the above steps are the same; the only things you need to do differently is replace the ZooKeeper IP with the Solr node's IP and port to 9983 instead of 2181.
Here are a couple common problems that may arise while setting up SolrCloud.
After you have created SolrCloud and are connecting to zookeeper, you may see an error like 8983 or 7574 is already in use.
:fuser -k 8983/tcp -
This finds the process and kills it. Another error you may see is that SolrCloud cannot find the newly created configset.
Solution: Do chmod 777 to the new configset. The more secure approach is to chown the folder to the Solr user.
Solr has a large community of experienced users and contributors and is more mature when compared to its competitors. Solr faces competition from Elasticsearch, which is open source and is also built on Apache Lucene. Elasticsearch is considered to be better at searching dynamic data such as log data while Solr handles static data better. In terms of scaling, while Elasticsearch has better in-built scalability features, with Zookeeper and SolrCloud, it’s easy to scale with Solr too.
Author bio: Pulkit Kidia is a backend engineer with experience in cloud services, system design and creating scalable backend systems. He loves to learn and integrate new backend technologies.
Published at DZone with permission of Pulkit Kedia , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.