Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Cassandra Cluster Install on Ubuntu 18.04 for Big Data

DZone's Guide to

Cassandra Cluster Install on Ubuntu 18.04 for Big Data

In this post, we take a look at how to install a complete 'production ready' Apache Cassandra cluster of three nodes. Read on to get started!

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In this post, I'm going to install a complete 'production ready' Apache Cassandra cluster of three nodes.

This post is an expansion on a previous post, Install Cassandra on Ubuntu 18.04 where I setup and configured a single Cassandra server. The problem is that it wasn't production ready and couldn't hold large volumes of data. I wanted to store big data using Cassandra and these means data sized > 1 TB. So this cluster will consist of three nodes each with 500 GB of space available for storage.

Video Available

If you don't want to read through all this, I also have a video available here:

What Is Cassandra?

Cassandra is a distributed database that is highly available and can store mass amounts of data. It is being used by large companies like Netflix, Apple, and eBay to manage their data with millions of requests a day.

Cassandra is very fault-tolerant. It can be scaled to hundreds or thousands of nodes where data is automatically replicated. Even if you lose an entire data center your data will be safe. Replication across data centers is also supported by Cassandra. Best of all, Cassandra is decentralized meaning that there is no single point of failure. Failed nodes can be replaced without any downtime.

SMACK Stack

Where does Apache Cassandra fit in our SMACK stack? Cassandra fills the role of long term database storage. We use Spark to perform data analytics in memory which is super fast but we will need to store some long term data to disk. This is where Cassandra comes in. Applications developed for our SMACK stack will utilize Cassandra to store data on disk.

Cassandra Cluster Install

To install this cluster I have provisioned three servers in VMWare. All three have 4 vCPUs, 4 GB of memory, and 500 GB of data storage.

We will be using this cluster for future big data articles where I will take server logs, send them to a pub/sub, perform some munging of the data and save to a table in Cassandra where I can do some simple SQL queries to go through my logs.

To install the cluster we need to get Cassandra installed on all three servers so follow the next few sections for every server. We will then start the services and join each node to the cluster. We will finish up the post by creating a keyspace and table for our data.

Install Cassandra on Ubuntu 18.04

In order to install Cassandra on Ubuntu 18.04 we will need to get some prerequisites out of the way first. For this exercise, I created a virtual machine with 4 vCPUs and 1G of memory. I allocated 200GB of hard drive space. Since this is a lab system I wanted to keep it modest. Make sure that you service is fully updated.

# apt update && apt upgrade -y

Next, we will need to ensure we have Java 8 and Python 2.7 installed.

# apt install openjdk-8-jdk -y
# apt install python -y

In the next section, we will begin installing Cassandra.

Download Cassandra

We will download the latest version of Cassandra which as of this writing is version 3.11.2.

Note: These versions change all the time so if this link doesn't work browse to http://apache.claz.org/cassandra find the latest bin version and copy the hyperlink for the wget command below:

# wget http://apache.claz.org/cassandra/3.11.2/apache-cassandra-3.11.2-bin.tar.gz

Untar the package and move to a better home:

# tar -xzvf apache-cassandra-3.11.2-bin.tar.gz
# mv apache-cassandra-3.11.2 /usr/local/cassandra

Next we will create a user for Cassandra to run as.

Creating a Cassandra User

Cassandra doesn't like to run as root so we will need to create a Cassandra user and a group so that we can set the correct permissions

#  useradd cassandra
#  groupadd cassandra
#  usermod -aG cassandra cassandra
#  chown root:cassandra -R /usr/local/cassandra/
#  chmod g+w -R /usr/local/cassandra/

Cassandra SystemD Service

Create a new file /etc/systemd/system/cassandra.service and add the following contents:

[Unit]
Description=Cassandra Database Service
After=network-online.target
Requires=network-online.target

[Service]
User=cassandra
Group=cassandra
ExecStart=/usr/local/cassandra/bin/cassandra -f

[Install]
WantedBy=multi-user.target

Notice that we configured the user and group as cassandra in the Service section. This will tell SystemD to run our service as the correct user that we configured earlier. Next we will start our new service and enable it to start on system boot.

# systemctl daemon-reload
# systemctl enable cassandra.service

We are not going to start our service yet because we need to configure it first and perform some performance tuning configuration because we are using this in 'production'.

Configuring Cassandra

In order to connect to our Cassandra server we will need to configure it. We will be changing configuration in the main Cassandra configuration file located at /usr/local/cassandra/conf/cassandra.yaml. Open this file so and make the following changes.

First, we will need to change the listening address for Cassandra. Set this to the IP address of your server. In my case this is 192.168.1.47.

listen_address: 102.168.1.47

Next, we need to set the RPC listen address so that remote connections from CQLSH will work. Again we set this to the IP address of our server.

rpc_address: 192.168.1.47

While we are here change the cluster name to something bettter:

cluster_name: 'AdminTome Cluster'

Lastly, we need to update the seeds parameter.

For this post we are simulating installing a Cassandra Cluster in a single data center and a single rack. The Seed controls the cluster and is used to bootstrap the other nodes in the Cassandra cluster. Typically, you will want at least one seed per rack per data center. You never want to have all nodes act as the seed.

For this, cluster the first node is our seed node so we will put its IP address (192.168.1.47) for this value on all three of our servers.

seed_provider:
    # Addresses of hosts that are deemed contact points. 
    # Cassandra nodes use this list of hosts to find each other and learn
    # the topology of the ring.  You must change this if you are running
    # multiple nodes!
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          # seeds is actually a comma-delimited list of addresses.
          # Ex: "<ip1>,<ip2>,<ip3>"
          - seeds: "192.168.1.47"

Save the file and exit.

Cassandra Performance Tuning

Now that we have our basic configuration completed, we need to configure some performance tuning parameters on our OS.

Disable Swap

The first thing we have to do is disable swap.

# swapoff -a

Also be sure to disable in your /etc/fstab file. Comment the line for the swap mount:

#/swap.imgnoneswapsw00

Configure Limits

Now we need to setup our system limits so that Cassandra runs smoothly.

Create /etc/security/limits.d/cassandra.conf and add the following contents:

cassandra - memlock unlimited 
cassandra - nofile 100000 
cassandra - nproc 32768
cassandra - as unlimited

Save and exit.

Configure Heap

Open /etc/sysctl.conf and add this line at the bottom:

vm.max_map_count = 131072

Save and exit.

And since we don't want to have to reboot, run this command to set it immediately:

sysctl vm.max_map_count=131072

We now have everything ready to start our server.

Starting Our Cassandra Cluster

We will be starting our Cassandra service one at a time starting with our first server which is the seed server.

Now start your Cassandra service

# systemctl start cassandra
# systemctl status cassandra

This will show that the service is started.

You can also follow the logs.

# journalctl -f -u cassandra.service

You are now able to connect to your Cassandra Database remotely. Simply download the Cassandra package on your development system just like we did on the server. Connect remotely

bill@admintome:~/Downloads/cassandra$ bin/cqlsh cassandra.admintome.lab
Connected to Test Cluster at cassandra.admintome.lab:9042.
[cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> select cluster_name, listen_address from system.local;
cluster_name      | listen_address
------------------+----------------
AdminTome Cluster | 192.168.1.47
(1 rows)
cqlsh>

Check our cluster status. Run the following command from the command line.

oot@cass1:~# /usr/local/cassandra/bin/nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  192.168.1.47  108.61 KiB  256          100.0%            5a82d2b5-3ad5-4c41-93cb-48c5b60deb27  rack1

We see our server is (U)p and it's state is (N)ormal.

Now we can start the Cassandra service on the other two nodes.

After the services are up we can run the command again and we should see our cluster is fully up and operational.

root@cass1:~# /usr/local/cassandra/bin/nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  192.168.1.47  108.61 KiB  256          67.1%             5a82d2b5-3ad5-4c41-93cb-48c5b60deb27  rack1
UN  192.168.1.48  112.97 KiB  256          65.0%             eabb7fd9-18c2-4e44-bcd2-5fcc1be3a83e  rack1
UN  192.168.1.49  40.75 KiB  256          67.8%             ef978de6-f733-4863-9ad9-435c1504d9b8  rack1

Now that the cluster is up let's create our table.

Create a keyspace and table

First we will create our keyspace, which is like a namespace in Cassandra and specifies a replication strategy for all tables in that namespace.

Start up CQLSH again and run this command:

CREATE KEYSPACE admintome WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};

Next, we will create our table in this keyspace:

CREATE TABLE admintome.logs ( id UUID, datetime text, source text, type text, log text, PRIMARY KEY (id) );

You can now run a select SQL command on the table and see that there is nothing there yet, but it proves it is created and ready to go.

cqlsh> create keyspace admintome with replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
cqlsh> create table admintome.logs ( id UUID, datetime text, source text, type text, log text, PRIMARY KEY (id) );
cqlsh> select * from admintome.logs;

 id | datetime | log | source | type
----+----------+-----+--------+------

(0 rows)
cqlsh>

There you have it. Our 'production' Cassandra cluster is up and ready to go.

Well, Actually It's Not

With normal RDBMs, our table would work fine. But Cassandra is no ordinary database. It's a distributed database so we have to keep that in mind when we model our table (I found out all this the hard way). For now, I will cover how to create the table the correct way. In a later article I will go over why this is the case.

DROP TABLE admintome.logs;
CREATE TABLE admintome.logs (
    log_source text,
    log_type text,
    log_id timeuuid,
    log text,
    log_datetime text,
    PRIMARY KEY ((log_source, log_type), log_id)
) WITH CLUSTERING ORDER BY (log_id DESC)

Now when we take our log data from Kafka and insert into our table we will get our logs in chronological order.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,apache cassandra ,cassandra cluster ,ubuntu ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}