Over a million developers have joined DZone.

Build a Hadoop Cluster in AWS in Minutes

Check out this process that will let you get a Hadoop cluster up and running on AWS in two easy steps.

· Cloud Zone

Build fast, scale big with MongoDB Atlas, a hosted service for the leading NoSQL database on AWS. Try it now! Brought to you in partnership with MongoDB.

I use Apache Hadoop to process huge data loads. Setting up Hadoop in a cloud provider, such as AWS, involves spinning up a bunch of EC2 instances, configuring nodes to talk to each other, installing software, configuring the master and data nodes' config files, and starting services.

This was a good use case to automate, considering I wanted to solve these problems.

  • How do I build the cluster in minutes (as opposed to hours and maybe even days for a large number of data nodes)?
  • How do I save money? With AWS, I need the ability to tear down when I'm not using it. If it is not automated, then I need to spend extra time building manually each time I tear down.
  • Again, how do I save money? AWS provides some managed services to build a Hadoop cluster, but there aren't too many options for the EC2 instance type you can choose (for example, m2-micro instance is not an option).

So, I decided to build a solution that would allow me to quickly setup a Hadoop cluster in AWS with any number of nodes in a matter of minutes (as opposed to days if I were to build manually).

A fully tested, Python-based solution can be found here.

The solution can be summarized in two steps.

  1. Creating AWS Resources using CloudFormation.

  2. Provisioning Hadoop on EC2 resources.

Creating AWS Resources

AWS CloudFormation provides an easy way to create and manage a pool of AWS resources. Simply upload a JSON file that describes your AWS resources, and the CloudFormation stack (collection of resources) is created. This is a subset of the full JSON to create the Hadoop cluster.

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "AWS CloudFormation Template for Hadoop Cluster",
  "Parameters": {
  },

  "Resources": {
    "EC2Instance1": {
      "Type": "AWS::EC2::Instance",
      "Properties": {
        "InstanceType": "t2.micro",
        "SecurityGroups": [
          {
            "Ref": "HadoopSecurityGroup"
          }
        ],
        "KeyName": "testkey",
        "ImageId": "ami-a9d276c9",
        "Tags" : [
            {"Key" : "Name", "Value" : "namenode"},
        ]
      },
     },
    "HadoopSecurityGroup": {
      "Type": "AWS::EC2::SecurityGroup",
      "Properties": {
        "GroupDescription": "Enable access for management (SSH), salt and hadoop. Port 22 is used for ssh, 4500-4600 is for salt master and minion, 8000-8100 is used by hdfs, 50000-51000 is used by hadoop processes",
        "SecurityGroupIngress": [
          {
            "IpProtocol": "tcp",
            "FromPort": "22",
            "ToPort": "22",
            "CidrIp" : "0.0.0.0/0"
          },
          {
            "IpProtocol": "tcp",
            "FromPort": "8000",
            "ToPort": "8100",
            "CidrIp" : "0.0.0.0/0"
          },
          {
            "IpProtocol": "tcp",
            "FromPort": "4500",
            "ToPort": "4600",
            "CidrIp" : "0.0.0.0/0"
          },
          {
            "IpProtocol": "tcp",
            "FromPort": "50000",
            "ToPort": "51000",
            "CidrIp" : "0.0.0.0/0"
          }
        ]
      },
    }


Using CloudFormation in the AWS management console, a template like this can be uploaded to create a stack of Hadoop cluster resources.

Image title

Provisioning Hadoop on EC2 Resources

Once the AWS resources are created (typically namenode, secondaryname, datanodes, and security group), the next step is to perform the install of Hadoop and its dependencies and set up the master and slave processes. This automation can be fully accomplished using the following tools in a Python project.

  • Saltstack: Salt is used for configuration management. It is used in a distributed remote execution system to apply commands on Hadoop nodes.

  • Python libs: Boto is the Python-based SDK for all AWS operations. It is used to retrieve information about the AWS CloudFormation stack resources. Fabric provides a friendly way to encapsulate all operations into separate tasks and also provides a nice command line interface to start execution.

The following diagram illustrates the various nodes involved in the cluster.

Image title

And there you have it! With that step done, all the following operations have been performed.

  • Installed Salt

  • Granted access to Hadoop nodes

  • Installed Java and Hadoop

  • Deployed Hadoop Config and set up nodes as Hadoop masters and data nodes

  • Started Hadoop services

Now it's easier than ever to get started with MongoDB, the database that allows startups and enterprises alike to rapidly build planet-scale apps. Introducing MongoDB Atlas, the official hosted service for the database on AWS. Try it now! Brought to you in partnership with MongoDB.

Topics:
hadoop ,aws ,cloud ,cluster ,cloudformation

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}