Over a million developers have joined DZone.

Hadoop in My Azure: Cloudera Distribution Hadoop on Azure

DZone's Guide to

Hadoop in My Azure: Cloudera Distribution Hadoop on Azure

Here's your guide to CDH on Azure.

· Cloud Zone ·
Free Resource

Learn how to migrate and modernize stateless applications and run them in a Kubernetes cluster.

Recently I worked on a unique Hadoop Project in which we deployed CDH (Cloudera Distribution for Hadoop) on Azure. This platform provides “Big Data as a Service” to Data scientists of a large organization. This deployment has rarely been implemented anywhere before. Sharing some of the knowledge and first-hand experience acquired while working on the project, the intension of this article is to give you a gist of the steps required to be followed for installing CDH on Azure.

There is a reference architecture document for Azure deployment available online from Cloudera. You may refer it along with this blog.

Reference Architecture

We used DS-14 type CentOS 6.6 machines for all nodes in the cluster. Also we used dedicated premium storage with each machine in the cluster. The reason for using premium storage is that it provides high Input Output Operations per second (5000+ IOPS per second per disk) for various data science related jobs. One major problem with premium storage is that we can’t add them to the backup vault.

For more details about various machine types in Azure, please refer to Machine Types.

For more details about Premium storage, please refer to Premium Storage Details.

The architecture of the platform looks like this:

Image title

Provisioning of Machines

The very first step is to setup VPC (Virtual Private Cloud) in Azure and configure it based on aspects like access to Internet, connectivity with other trusted networks, access to other Azure services etc.

After that, we used the Azure command line to provision instances. The same can be done via Azure Management Portal. To provision machines, create an SSH key pair with which we could log into the instances.

Install Azure CLI (from Mac Yosemite):

brew install node npm install -g azure-cli

Connect Account:

Download the certificate from portal, by logging in:

azure account download azure account import

Set the right account:

azure account list

There might be more that 1 account.

azure account set Account name

Define command line variables:

STEP 1: Create StorageAccount for each Machine (through Azure Portal):

export vmStorageAccountName=clouderaw1store export vmStaticIP=XX.XX.XX.XX export vmName=CLOUDERAW1

STEP 2: Create a container:

Find the Connection String in Azure Portal: Browse > Storage Account (classic) > clouderaw1store > Settings > Keys > Primary Connection String

azure storage container create --container "vhds"
--connection-string "paste-here-connection-string" vhds

STEP 3: Create CentOS node:

azure vm create --vm-name ${vmName} --virtual-network-name NETWORK1 --blob-url Link to Blog URL --static-ip ${vmStaticIP} --userName clouderaadmin --ssh Specify Port Number --ssh-certkey.pub --no-ssh-password --vm-size Standard_DS14 --availability-set WORKER_AVS --connectcloudera-hadoop "5112500ae3b842c8b9c604889f8753c3__OpenLogic-CentOS-66-20150706"

STEP 4: Attach disks – Multiple Disks:

azure vm disk attach-new --host-caching ReadOnly ${vmName} 512https://${vmStorageAccountName}.blob.core.windows.net/vhds/${vmName}-application.vhd

azure vm disk attach-new --host-caching ReadOnly ${vmName} 1023https://${vmStorageAccountName}.blob.core.windows.net/vhds/${vmName}-hadoop.vhd

azure vm disk attach-new --host-caching ReadOnly ${vmName} 1023https://${vmStorageAccountName}.blob.core.windows.net/vhds/${vmName}-hadoop1.vhd

azure vm disk attach-new --host-caching ReadOnly ${vmName} 1023https://${vmStorageAccountName}.blob.core.windows.net/vhds/${vmName}-hadoop2.vhd

STEP 5: Validate

Validate VM in Azure portal and perform SSH:

ssh -i pathtoprivate_key clouderaadmin@{IP Address} -p {Port Number}

Each Azure instance has an OS disk (/dev/sda). The purpose of this disk is to provide fast boot time. It should not be used for any other purpose. The second disk (/dev/sdb) is a temporary disk which is used for Linux Swap file purposes. We can attach additional disks (Max 32 x 1 TB disks on DS-14 type machines) with each VM storing log files, application data, Hadoop data etc. These can be located as /dev/sdc/dev/sdd etc. We can check read speed of disks using hdparmHdparm is a Linux utility that allows to quickly find out read speed of a hard drive.

sudo yum install hdparm

hdparm -t /dev/sdc

Environment details -

Gateway machine or Stepping Stone Machine (2 Extra Disks) – 1. 512 GB for Applications 2. 1 TB for Hadoop Data

Worker machines (4 Extra Disks) – 1. 512 GB for Applications 2. 3 x 1 TB for Hadoop

Master machines (2 Extra Disks) – 1. 512 GB for Applications 2. 1 TB for NN Dir

Ansible and Cloudera Manager Installation

Once machines are provisioned, we used Ansible script to perform below tasks in automated way for preparing machines and setting up environment:

• Disable SELinux

• Enable Swap on Azure CentOS

• Set Swappiness to 1

• Disable IPv6

• Partition Disks

• Format Disks

• Mount Disks

• Setup NTP etc.

For more details about installing Cloudera Manager via Ansible, please refer to CM via Ansible

Cloudera Manager 5 can be installed using the link below:


CDH can be installed by following the steps mentioned in below guide:


Another easier way to install CDH on Azure could be using Azure Templates.


Azure-cli provided nice way to provision VM, install harddisks etc. Ansible scripts helped us in preparing machines for Hadoop quickly in an automated way. 

Microsoft and Cloudera are working hard on making this unique deployment combination a success. Hopefully, we can see more such Hadoop deployments on Azure in the near future.

Join us in exploring application and infrastructure changes required for running scalable, observable, and portable apps on Kubernetes.

azure ,hadoop ,big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}