Elasticsearch Setup and Configuration
Elasticsearch Setup and Configuration
If you're looking to get up and running with Elasticsearch to allow users to search through data, then this is the article for you! Have a look!
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
What Is Elasticsearch?
Elasticsearch is highly scalable, broadly distributed open-source full-text search and analytics engine. You can, in very near real-time search, store and index big volumes of data. It internally uses Apache Lucene for indexing and storing data. Below are few use cases for it.
- Product search for e-commerce websites.
- Collecting application logs and transaction data for analyzing it for trends and anomalies.
- Indexing instance metrics (health, stats) and doing analytics, creating alerts for instance health at regular intervals.
- For analytics/business-intelligence applications.
Elasticsearch Basic Concepts
We will be using a few terminologies while talking about Elasticsearch. Let's look at the basic building blocks of Elasticsearch.
Elasticsearch is near real-time. This describes the time (latency) between the indexing of a document and its availability for searching.
It is a collection of one or multiple nodes (servers) that together hold all the data and provide you the ability to index and search that cluster for data.
It is a single server that is part of your cluster. It can store data, participate in indexing and searching and overall cluster management. A node could have four different flavors, i.e. master, HTTP, data, coordinating/client nodes.
An index is a collection of similar kind/characteristics of documents. It is identified by name (all lowercase) and is referred to by name to perform indexing, searching, and update and delete operations against documents.
It is a single unit of information that can be indexed.
Shards and Replicas
A single index can store billions of documents which can lead to storage taking up TBs of space. A single server could exceed its limitations to store a massive amount of information or perform a search operation on that data. To solve this problem, Elasticsearch sub-divides your index into multiple units called shards.
Replication is important primarily to have high availability in case of node/shard failure and to allow you to scale out your search throughput. By default, Elasticsearch has 5 shards and 1 replica, which could be configured at the time of creating an index.
Elasticsearch requires Java to run. As of writing this article, Elasticsearch 6.2.X+ requires at least Java 8.
Installing Java 8
#Installing Open JDK sudo apt-get install openjdk-8-jdk #Installing Oracle JDK sudo add-apt-repository -y ppa:webupd8team/java sudo apt-get update sudo apt-get -y install oracle-java8-installer
Installing Elasticsearch with a tar file
curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.4.tar.gz tar -xvf elasticsearch-6.2.4.tar.gz
Installing Elasticsearch with a package manager
#import the Elasticsearch public GPG key into apt: wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add - #Create the Elasticsearch source list echo "deb http://packages.elastic.co/elasticsearch/6.x/debian stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-6.x.list sudo apt-get update sudo apt-get -y install elasticsearch
Configuring an Elasticsearch Cluster
Configuration file location if you have downloaded the tar file:
Configuration file location if you used a package manager to install Elasticsearch:
Use some descriptive name for the cluster. Elasticsearch nodes will use this name to form and join a cluster.
To uniquely identify a node in the cluster:
Custom attributes for a node
Adding a rack to a node to logically group the nodes placed on same data center/physical machine:
A node will bind to this hostname or IP address and advertise this host to other nodes in the cluster.
network.host: [_VPN_HOST_, _local_]
Elasticsearch does not come with authentication and authorization. So, it is suggested to never bind a network host property to the public IP address.
Cluster finding settings
To find and join a cluster, you need to know at least a few other hostname or IP addresses. This could easily be set by the
Changing the HTTP port
You can configure the port number on which Elasticsearch is accessible over HTTP with
Configuring JVM Options (Optional for Local/Test)
You need to tweak JVM options as per your hardware configuration. It is advisable to allocate half the memory of the total server's available memory to Elasticsearch and the rest will be taken up by Lucene and Elasticsearch threads.
#For example if your server have eight GB of RAM then set following property as -Xms4g -Xmx4g
Also, to avoid performance hits, let Elasticsearch block the memory with the
bootstrap.memory_lock: true property.
Elasticsearch uses concurrent mark and sweep GC and you can change it to G1GC with the following configurations.
-XX:-UseParNewGC -XX:-UseConcMarkSweepGC -XX:+UseCondCardMark -XX:MaxGCPauseMillis=200 -XX:+UseG1GC -XX:GCPauseIntervalMillis=1000 -XX:InitiatingHeapOccupancyPercent=35
sudo service elasticsearch restart
Tada! Elasticsearch is up and running on your local machine.
To have a production-grade setup, I would recommend visiting following articles.
Published at DZone with permission of Gaurav Rai Mazra , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.