Over a million developers have joined DZone.

Setting Up RecoverX: Cloud-Native Data Protection

DZone 's Guide to

Setting Up RecoverX: Cloud-Native Data Protection

This step-by-step guide walks you through utilizing Datos IO's RecoverX to back up and protect your data — in this case, a Cassandra Database on Google Cloud Platform.

· Cloud Zone ·
Free Resource

This tutorial shows you how to set up Datos IO RecoverX — cloud-native data protection software — on Google Cloud Platform. Follow this tutorial to deploy and configure Datos IO RecoverX to protect your Cassandra (Apache or DataStax) database cluster. This tutorial assumes that your Cassandra database is already deployed and is fully operational.

You generally deploy RecoverX in the same project as the Cassandra database that you need to protect. If you are deploying RecoverX in a different project, you need to provide SSH connections from all RecoverX nodes to the compute nodes on which Cassandra is deployed. RecoverX connects to the Cassandra nodes through SSH connections, and uses standard Cassandra APIs to take snapshots, and stream the snapshots in parallel to Google Cloud Storage. After the data is copied, RecoverX processes the data to create a single, golden copy of the database that is cluster consistent and has no replicas. The following diagram shows a representative deployment:

Image title


  • Provision infrastructure to deploy RecoverX software.
  • Configure the Cassandra database.
  • Configure the RecoverX compute nodes.
  • Deploy RecoverX and connect from a remote location.


  • Datos IO RecoverX is scale-out software that is deployed in a clustered configuration on three compute engine instances, type n1-standard-8.
  • In addition, you need to provision a cloud storage bucket for storing the backup data. The capacity of cloud storage required will depend on your database size, change rate, retention time, and other factors.
  • Use the Pricing Calculator to generate a cost estimate based on your projected usage.
  • In addition to the cloud platform infrastructure costs, the RecoverX software is licensed directly through Datos IO, based on the physical size of the database that needs to be protected, in terabytes.
  • Contact info@datos.io for any pricing related questions.

Before You Begin

  1. Select or create a cloud platform console project.
  2. Enable billing for your project.

Creating a Compute Engine Instance and Cloud Storage Bucket

The compute nodes for Cassandra and the compute nodes for RecoverX software need to have R+W permissions to the cloud storage bucket that is used as secondary storage. Follow the steps below to create RecoverX compute instances and provision secondary storage with correct permissions:

  1. Configure an IAM role or ACL for a service account to allow R+W access to the cloud storage bucket, as listed in the Access Control Options. You can use the same service account that was used to create Compute Engine instances for Cassandra nodes. Use the role assignment: Editor.
  2. Create three Compute Engine instances of the type n1-standard-8 using this service account.
  3. Select CentOS 6.
  4. In the Firewall section, select Allow HTTPS traffic.
  5. Create an SSD disk (blank disk) with a capacity of at least 140 GB for each Compute Engine instance.
  6. Select an appropriate network.
  7. Run the following command to format the filesystem and mount the volume. Replace [VOLUME_NAME] and[RECOVERX_NODE_NAME] with the appropriate values:
    gcloud compute ssh [RECOVERX_NODE_NAME] 'sudo mkfs -t ext4 [VOLUME_NAME]; sudo mkdir /home sudo mount [VOLUME_NAME] /home'
  8. Create a cloud storage bucket using this service account.

Configuring the Cassandra Cluster

This is the database cluster that you want to protect using RecoverX.

Creating a Datos IO User on Wach Cassandra Node

  1. Create a Datos IO user account, such as datos_db_user, on each Cassandra node. This account is used for running commands to extract data from the source cluster. This user should have the same group ID (GID) as the Cassandra user. The following command requires that Cassandra user is a part of the Cassandra group, and it addsdatos_db_user to the same group.
    gcloud compute ssh [CASSANDRA_NODE_INSTANCE_NAME] 'sudo useradd -g cassandra -m datos_db_user'
  2. Configure authentication for the user you created by using one of the methods such as:
    • Username and password.
    • Username and SSH key with passphrase.
    • Username and SSH access key.
    You will need this information while adding the data source to RecoverX environment.
  3. Give datos_db_user write permission to its home directory /home/datos_db_user on all Cassandra nodes. Replace [CASSANDRA_NODE_INSTANCE_NAME] with the name of your instance:
    gcloud compute ssh [CASSANDRA_NODE_INSTANCE_NAME] 'sudo chmod -R u+w /home/datos_db_user'
  4. Give read and execute permissions to the $CASSANDRA data directory and its parent directory on all Cassandra nodes. Replace [CASSANDRA_NODE_INSTANCE_NAME] with the name of your instance:
    gcloud compute ssh [CASSANDRA_NODE_INSTANCE_NAME] 'sudo chmod -R g+rx /var/lib/cassandra; sudo chmod -R g+rx /var/lib/cassandra/data'

Configuring Maximum SSH Sessions

For each node, edit the file /etc/ssh/sshd_config to set sshd parameters MaxSessions to 500 and MaxStartupsto "500:1:500". You can verify the values of these parameters by using the commands below:

/usr/sbin/sshd -T | grep -i maxs
maxsessions 500
maxstartups 500:1:500

Setting up the Datos IO RecoverX cluster

Setting Up Network Ports

Open the following ports:

Network Protocol:Port Purpose
From External - To RecoverX TCP:9090 Access Datos IO UI/API.
RecoverX nodes (network) TCP:2888 Internal distributed software communication.
TCP:3888 Internal distributed software communication.
TCP:2181 Internal distributed software communication.
TCP:15039 RecoverX Metadata database.
TCP:5672 Internal messaging/communications (RabbitMQ).
SSH:22 For RecoverX nodes to communicate with each other.
From RecoverX - To Cassandra nodes TCP:9042 Cassandra driver port.
TCP:7199 Cassandra JMX port.
SSH:22 For RecoverX to communicate with Cassandra database nodes.
Cassandra nodes (network) TCP:7000 Cassandra storage port.
TCP:9160 Cassandra RPC port.

Creating a Datos IO User on Each RecoverX Node

Create a Datos IO user account, such as datos_user, on each RecoverX node. This user should have the same group ID (GID) as the datos_db_user. For example, if the datos_db_user has GID 1001datos_user must have GID 1001.

  1. To get the GID, run:
    id datos_db_user
  2. Run the following command using the GID you retrieved:
    sudo groupadd -g [GID] cassandra
  3. Add the user:
    sudo useradd -g cassandra -m datos_user -d /home/datos_user

This user should have:

  • The home directory on the non-root volume that was previously created.
  • Passwordless SSH access to each RecoverX node in the cluster, including itself.

Providing sudo Privileges

The datos_user on the RecoverX nodes must have sudo privileges for the following commands:


To add this privilege:

  1. Sign in as the root user.
  2. Use the visudo command to edit the configuration file for sudo access.
  3. Append the following line to the file:
    datos_user ALL=NOPASSWD: /sbin/chkconfig, /bin/cp

Configuring RecoverX Nodes

Make the following changes in the limits.conf file of all RecoverX nodes.

  1. Edit the nproc and nofiles parameters in /etc/security/limits.conf to match the following:
    • hard nproc unlimited
    • soft nproc unlimited
    • hard nofile 64000
    • soft nofile 64000
  2. Edit the nproc parameter/etc/security/limits.d/90-nproc.conf to match the following:
    • hard nproc unlimited
    • soft nproc unlimited
  3. Verify the changes above by running the following command:
    ulimit -a
  4. Make sure that the /tmp directory has at least 2 GB empty space on each RecoverX node.

Verifying Host Name Entry on RecoverX Nodes

Ensure that the short name and FQDN of the RecoverX node is included in its /etc/hosts file. For example, a RecoverX node with the hostname datosserver.dom.local and an IP address of would have a hosts file listing similar to the following:

cat /etc/hosts localhost.localdomain localhost datosserver datosserver.dom.local
::1 localhost6.localdomain6 localhost6

Installing RecoverX Software

Follow the steps below to install RecoverX software. Be sure to use the datos_user name that you created earlier.

  1. Copy the RecoverX compressed tarball to one of the compute nodes:
    gcloud compute copy-files datos_[VERSION].tar.gz <recoverx_node_name>:~gcloud compute ssh 'sudo mv datos_[VERSION].tar.gz /home/datos_user; sudo chown datos_user /home/datos_user/datos_[VERSION].tar.gz'
  2. Uncompress the tarball on the target node. A top-level directory called datos_[VERSION] should appear:
    tar -zxf datos_[VERSION].tar.gz
  3. Switch to the uncompressed Datos IO directory.
    cd datos_[VERSION]
  4. Create the target installation directory on all nodes where the RecoverX software will operate.
  5. Install the software in the target installation directory. Replace the [IP_ADDRESS#] values with the internal IP addresses of the instances:
    ./install_datos --ip-address [IP_ADDRESS1] [IP_ADDRESS2] [IP_ADDRESS3] --target-dir /home/datos_user/datosinstall
  6. Upon successful installation, a message similar to the following should appear:
    <timestamp> : INFO: Completed installation of datos software (version <version>) to location /home/datos_user/datosinstall

Accessing RecoverX Software

RecoverX has a consumer-grade graphical user interface accessible through a web-based console. To log into the console, follow these steps:

  1. Use a web browser to connect to the console with the following URL. Replace [IP_ADDRESS] with the IP address of the node where RecoverX is deployed: https://[IP_ADDRESS]:9090/#/dashboard.
  2. Connecting to the UI requires the user to connect to the public IP address of the primary RecoverX node. To identify the primary RecoverX node, run the CLI command datos_status located in the installation folder of any RecoverX node.
  3. At the login screen, enter the default username "admin" and default password "admin." On successful login, the home page should appear.
  4. After logging in for the first time, change the password for the administrator account by clicking the Settings menu and choosing CHANGE PASSWORD.

Configuring RecoverX

After you have logged into the GUI, use the CONFIGURATION panels and follow the instructions on these panels to:

Cleaning Up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial...

Delete the Project

The easiest way to eliminate billing is to delete the project you created for the tutorial.

To delete the project (Note: If you are exploring multiple tutorials and quickstarts, reusing projects instead of deleting them prevents you from exceeding project quota limits.):

  1. In the cloud platform console, go to the Projects page.
  2. Click the trash can icon to the right of the project name.

Delete Your Compute Engine Instances

To delete a compute engine instance:

  1. In the Cloud Platform Console, go to the VM Instances page.
  2. Click the checkbox next to the instance you want to delete.
  3. Click the Delete button at the top of the page to delete the instance.

Delete your Cloud Storage Bucket

To delete a Cloud Storage bucket:

  1. In the Cloud Platform Console, go to the Cloud Storage browser.
  2. Click the checkbox next to the bucket you want to delete.
  3. Click the Delete button at the top of the page to delete the bucket.
database ,data protection ,google cloud ,cloud

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}