Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Elasticsearch Cluster with Vagrant and Virtualbox

DZone's Guide to

Elasticsearch Cluster with Vagrant and Virtualbox

Check out this tutorial that will show you how to set up a cluster of virtual machines with Virtualbox and Vagrant as the scriptor.

· Cloud Zone ·
Free Resource

Insight into the right steps to take for migrating workloads to public cloud and successfully reducing cost as a result. Read the Guide.

A simple way to simulate a distributed storage and compute environment is with Virtualbox as the provider of VMs ('Virtual Machines') and Vagrant as the front-end scripting engine to configure, start, and stop those VMs. The goal for this post is to build a clustered virtual appliance offering Elasticsearch as a service that can be consumed/controlled by a host machine. The artifacts used in this article can be downloaded from Github.

1. Background

Backend capacity scaling in the face of increasing front-end demand has generally been addressed by replacing weaker servers with more powerful ones, CPU/RAM/disk wise — so-called 'Vertical Scaling'. This is as opposed to 'Horizontal Scaling,' where more servers are simply added to the mix to handle the extra demand. Intuitively, the latter model is appealing as it sounds like less work! In the traditional RDBMS centric applications, there was no choice, and vertical scaling actually made sense because it is difficult to do joins across large distributed data tables. But vertical scaling has its limits and, more importantly, becomes very expensive well before hitting those limits. NoSQL databases that skimp on relations (the 'R' of RDBMS) to allow for simpler horizontal scaling have become the go-to datastores nowadays for applications that need to scale large as in facebook/google large.

The reader is referred to Hadoop: the Definitive Guide, where Tom White goes over these scale issues in depth. Applications running on distributed storage & CPU have to deal with their own issues like keeping a CPU busy on the data that is 'local' to it, making sure that cluster members are aware of one another and know who has what piece of the data, and perhaps elect a leader/master as needed for coordination, writes etc., as the implementation details vary across systems. We are not going to delve into all that here but our goals for this post are more pragmatic:

    1. Develop a means to run a virtual cluster of a few nodes ('guests') where the guests for now are carved out of my laptop by Virtualbox. Later we will extend the same means to run services on a cluster of nodes provided by AWS
    2. Install a distributed data store on this cluster of guests. Elasticsearch right now, so we can go through the mechanics
    3. Confirm that this 'virtual Elasticsearch appliance' offers a completely controllable service from the host.

2. VirtualBox

We use Oracle's Virtualbox as the provider of guest virtual hosts. Virtualbox is free to use, runs very well on my Linux laptop (Ubuntu 15.04 64bit on my laptop with 8 core i7, 2.2GHz CPU, 16GB RAM), and has extensive documentation on how to control the various aspects of the hosts to be created. There are prebuilt images as well of any number of open source Linux distributions that you can simply drop in for the guest OS. It offers a variety of networking options (sometimes daunting as I found out) to expand/limit the accessibility/capability of the guests. For our purposes, we prefer a 'host-only', 'private' network with the following criteria.

  • The guests and hosts should be able to talk to each other. We want the guests to form a cluster and work together to enable a service. The host should be able to control & consume the services offered by the cluster of guests.
  • The guests should be able to access the internet. This is so they can download any OS updates, software packages they need in order to run whatever application.
  • The guests cannot be accessed from outside. This is just a made up requirement at this time as I do not want to expose the service to the outside. The host is the consumer of the service and it may roll that into its own service that it can offer to the outside if it so desires.
  • Lastly, for ease of use and portability, each guest should have an IP address & name 'assigned' at the time of its creation.

Installing Virtualbox and creating VMs of various kinds is quite straightforward. Based on a prebuilt image that I downloaded, I could set up a single VM fine the way I wanted. Used NAT for adapter1, host-only for adapter2, and activating the host-only interface on the VM. I wanted to clone it and build other guests, but I had troubles getting the networking right in a reliable/repeatable fashion. Networking was never my strong suit and after playing some with their networking options both via the GUI & command-line, I gave up trying to master it. I am sure the networking gurus out there can do it, so it is certainly not a limitation of Virtualbox but a limitation on my part.

But more reasonably though, I did not want to be logging into the guest to set up stuff or worse — changing settings for each guest via the GUI that VirtualBox offers. That will definitely not scale, a pain to reproduce, and is error-prone. I wanted a turn-key solution of sorts wherein I could script out all aspects of the VM cluster creation upfront, and simply run it to have that cluster created with all the tools installed, started and rearing to go.

Vagrant allows one to do that easily, as I happily found out. Basically, they have already figured out the exact sequence of 'vboxmanage' commands (and their options!) to run to set up a cluster specified by some high-level requirements...which was what I was trying to do and they have already done it! Plus, as the cluster set up with Vagrant is file-based, we can version it, and share it (small compared to an OVA file) to have the cluster reproduced exactly elsewhere. Maybe I am biased because of the issues I had with the networking setup, but the reader is referred to discussions similar to Why Vagrant? or Why should I use Vagrant instead of just VirtualBox? The real appeal of Vagrant in the end for me was that it can seamlessly work with other VM providers such as AWS, VMWARE via plugins, so the same config files/scripts can be reused simply by changing the provider name. Carving resources out of my laptop to build VMs is fine here for getting the mechanics down, but it is not going to give performant cluster!

3. Vagrant

Having spent a lot of words trying to get here, we jump right in with no further ado. We prepare a text file by the name 'Vagrantfile' with high-level details on the cluster we are going to build. Running at the command prompt will generate a sample file that can be edited to our liking. Here is how our file looks like to meet our requirements laid out in section 2.

 Vagrantfile 

# -*- mode: ruby -*-
# vi: set ft=ruby :
nguests = 2
box = "hashicorp/precise64"
memory = 8256/nguests # memory per box in MB
cpuCap = 100/nguests
ipAddressStart = '192.168.1.5'
Vagrant.configure("2") do |config|
  (1..nguests).each do |i|
    hostname = 'guest' + i.to_s
    ipAddress = ipAddressStart + i.to_s
    config.vm.define hostname do |hostconfig|
      hostconfig.vm.box = box
      hostconfig.vm.hostname = hostname
      hostconfig.vm.network :private_network, ip: ipAddress
      hostconfig.vm.provider :virtualbox do |vb|
        vb.customize ["modifyvm", :id, "--cpuexecutioncap", cpuCap, "--memory", memory.to_s]
      end
      hostconfig.vm.provision :shell, path: "scripts/bootstrap.sh", args: [nguests, i, memory, ipAddressStart]
    end
  end
end

It is a Ruby script but one does not need to know a lot of Ruby, which I did not. Here is a quick rundown on what it does.

  • We want to set up a 2 node cluster (Line #3).
  • We choose as the OS image on each. Vagrant downloads it if that image has not already been downloaded before to the local repo ('The Default Machine Folder' for VirtualBox) (Line #4)
  • My laptop has 16gb RAM and I want to leave 8gb for the host at all times. The rest is divided equally among the guests. Likewise the guests are limited to a fractional use of the CPU. (Lines 5, 6)
  • We loop over each guest:
    • Setting its image (#13), and name (#14).
    • We choose a 'private_network' mode and set the IP address ( # 15). This gives us the network model we wanted in Section 2.
    • Line # 19, is about provisioning the VM with tools, and apps. Extremely powerful & handy. We can automate the process of bringing up each member of the cluster with just the apps we want that guest to be responsible for. No need to ssh to each guest and go through separate installs — a great time saver! Besides simple shell scripts Vagrant allows for other mechanisms like Docker, Chef, Ansible, Puppet, etc., for provisioning process. Here we use a shell script 'bootstrap.sh' to which we pass the arguments we need, to set up Elasticsearch.

That is all for Vagrant, really. The rest is all good old shell scripting at which we are old hands — fabulous! Once the scripts are ready, we run to have the cluster come up, do our work and run to power the cluster down. Until we run the cluster will retain its apps/config/data so we can run anytime to use the cluster and its services.

4. Provisioning Elasticsearch

This is fairly straightforward. The key thing to know is that Vagrant automatically enables one shared directory between the host & guests. That is the directory where the file 'Vagrantfile' is located. On the guests, this directory is accessed as '/vagrant'. So if we have file 'a/b/c/some_file' at the location where 'Vagrantfile' is on the host, that 'some_file' can be accessed on the guest as '/vagrant/a/b/c/some_file'. We use this feature to share pre-downloaded packages we need to install on guests, and any scripts we want to run, post boot time. The bootstrap.sh  script is as follows.

#!/usr/bin/env bash

nguests=$1
guestNumber=$2
memory=$3
ipAddressStart=$4

# Install some utilities that we will need
apt-get -y install unzip
apt-get -y install curl

# Install java
mkdir -p /opt/software/java
cd /opt/software/java ; tar zxvf /vagrant/tools/jdk-8u65-linux-x64.tar.gz

# Install & Start up elasticsearch
/vagrant/scripts/elastic.sh $nguests $guestNumber $memory $ipAddressStart

We install some utilities we will need in lines #9 and #10. Install java from the shared location in lines #13 and #14. Finally we run the script below to install Elasticsearch in line #17.

 elastic.sh 

#!/usr/bin/env bash

usage="Usage: elastic.sh nguests thisguest memory ipAddressStart. Need the number of guests in the cluster, this guest number, es-heap memory in MB like 2048m, and startingIp like 192.168.0.5 if clustered ... "
# Install Elastic,  Configure & Start
function setUnicastHosts() {
  local unicast_guests="discovery.zen.ping.unicast.hosts: ["
  for i in $(seq 1 $nguests); do
    unicast_guests+='"guest-es'$i
    unicast_guests+=':9310"'
    if [ "$i" -ne "$nguests" ]; then
      unicast_guests+=','
    fi
  done
  unicast_guests+=']'
  echo "$unicast_guests"
}
# Add to /etc/hosts for convenience & restart networking...
function setEtcHosts() {
  guest_list=""
  for i in $(seq 1 $nguests); do
          guest_list+=$ipAddressStart$i' guest-es'$i$'\n'
  done
  echo "$guest_list" > guests_to_be_added
  cat /etc/hosts guests_to_be_added > tmp ; mv tmp /etc/hosts
  /etc/init.d/networking restart
}
if [ "$#" -eq 4 ]; then
  nguests=$1
  thisguest=$2
  memory=$(expr $3 / 2)
  memory+="m"
  ES_HEAP_SIZE=$memory
  ipAddressStart=$4
  ES_HOME=/opt/software/elasticsearch/elasticsearch-1.7.2
  mkdir -p /opt/software/elasticsearch
  cd /opt/software/elasticsearch ; unzip /vagrant/tools/elasticsearch-1.7.2.zip
  cp /vagrant/elastic/start-node.sh $ES_HOME
  cp /vagrant/elastic/stop-node.sh $ES_HOME
  cp /vagrant/elastic/elasticsearch.yml $ES_HOME/config
  guest_name="guest-es"$thisguest
  node_name=$guest_name"-node1"
  unicast_guests=$(setUnicastHosts)
  if [ "$thisguest" -eq 1 ]; then
    mkdir -p $ES_HOME/plugins/kopf
    cd $ES_HOME/plugins/kopf ; tar zxvf /vagrant/elastic/kopf.tar.gz
  fi
  perl -0777 -pi -e "s|ES_HOME=/opt/elasticsearch|ES_HOME=$ES_HOME|" $ES_HOME/start-node.sh
  perl -0777 -pi -e "s/ES_HEAP_SIZE=2g/ES_HEAP_SIZE=$memory/" $ES_HOME/start-node.sh
  perl -0777 -pi -e "s/host_name=localhost/host_name=$guest_name/" $ES_HOME/start-node.sh
  perl -0777 -pi -e "s/host_name=localhost/host_name=$guest_name/" $ES_HOME/stop-node.sh
  perl -0777 -pi -e "s/node_name=node0/node_name=$node_name/" $ES_HOME/start-node.sh
  perl -0777 -pi -e "s/$/\n$unicast_guests/" $ES_HOME/config/elasticsearch.yml
else
  echo $usage
  exit 1
fi
setEtcHosts
$ES_HOME/start-node.sh

An Elasticsearch node is a running instance of Elasticsearch, and a  server can run multiple instances – resources permitting of course. All the nodes that are part of a cluster have the same ‘cluster.name’. Starting with some boiler-plate configuration files that are shared between the host & guests, the script above modifies them based on the arguments passed to each guest during provisioning. The file ‘config/Elasticsearch.yml’ for all guest nodes will be augmented with a list of all members of the cluster.

 discovery.zen.ping.unicast.hosts: ["guest-es1:9310","guest-es2:9310"]

The function setEtcHosts  appends

  • 192.168.1.51 guest-es1

  • 192.168.1.52 guest-es2

to ‘/etc/hosts’ file on each guest and restarts the network. The script start-node.sh  below prepared for ‘guest2’ runs the following command to start up the  Elasticsearch node ‘guest-es2-node1’.

 start-node.sh 

/opt/software/elasticsearch/elasticsearch-1.7.2/bin/elasticsearch -d 
-Des.cluster.name=es-dev 
-Des.node.name=guest-es2-node1 
-Des.http.port=9210 
-Des.transport.tcp.port=9310 
-Des.path.data=/opt/software/elasticsearch/elasticsearch-1.7.2/data 
-Des.path.logs=/opt/software/elasticsearch/elasticsearch-1.7.2/logs 
-Des.path.plugins=/opt/software/elasticsearch/elasticsearch-1.7.2/plugins 
-Des.path.conf=/opt/software/elasticsearch/elasticsearch-1.7.2/config 
-Des.path.work=/opt/software/elasticsearch/elasticsearch-1.7.2/tmp 
-Des.network.host=guest-es2 -Des.network.publish_host=guest-es2 
-p /opt/software/elasticsearch/elasticsearch-1.7.2/pid

where ‘es-dev’ is the name of the cluster we are building. The command on ‘guest1’ to start ‘guest-es1-node1’ would be identical to the above, except for replacing ‘es2’ with ‘es1’. 

We fire up our virtual elastic cluster simply by running  vagrant up  . Because we have installed the ‘kopf’ plugin on ‘guest1’ during provisioning, we can verify that the cluster is up, accessible from the host & ready to be put to work.

Image title

We shut the cluster off by running  vagrant halt  . Whenever we are ready to work with it again from the host we simply run  vagrant up  and the cluster will be back up. Success! We have put in place a mechanism to bring up Elasticsearch as a service, as needed on a virtual cluster.

That is all for this post. In future posts, we will look at extending this to create appliances on AWS so we can do real work.

TrueSight Cloud Cost Control provides visibility and control over multi-cloud costs including AWS, Azure, Google Cloud, and others.

Topics:
devops ,elasticsearch ,vagrant ,virtualbox ,cluster ,host machine ,cloud

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}