Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Tweaking Ganglia For Your Hadoop Cluster

DZone's Guide to

Tweaking Ganglia For Your Hadoop Cluster

Free Resource

Discover 50 of the latest mobile performance statistics with the Ultimate Guide to Digital Experience Monitoring, brought to you in partnership with Catchpoint.

It's very important to monitor all the machines in a cluster in terms of OS health, bottlenecks, performance hits, and so on. Numerous tools spit out huge numbers of graphs and statistics. But to the administration of the cluster, only the prominent stats seriously affecting the performance or the health of the cluster should be portrayed.                       

Ganglia fits the bill

    Ganglia is a scalable, distributed monitoring system for high-performance computing systems, such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It relies on a multicast-based listen/announce protocol to monitor a state within clusters, and it uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state. It leverages widely used technologies, such as XML – for data representation, XDR – for compaction, portable data transport and RRDtool – for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads but very high concurrency.
    The implementation is robust, and it has been ported to an extensive set of operating systems and processor architectures. It is currently in use on more than 500 clusters around the world. It has been used to link clusters internationally across university campuses, for example, and can scale to handle clusters with 2000 nodes.

Here are some pure Ganglia terminologies to know before getting started:

1. Node: Generally, it's a machine targeted to perform a single task with its (1-4) core processor.
2. Cluster: A cluster consists of a group of nodes.
3. Grid: A grid consists of group of clusters.

Heading towards Ganglia...

    The ganglia system is composed of two unique daemons, a PHP-based web frontend and a few other small utility programs.

Ganglia Monitoring Daemon(gmond)

    Gmond is a multi-threaded daemon that runs on each cluster node you want to monitor. It's responsible for monitoring changes in the host state; announcing relevant changes; listening to the state of all other ganglia nodes via a unicast or multicast channel; and answering requests for an XML description of the cluster state. Each gmond transmits information in two different ways: unicasting/multicasting the host state in external data representation (XDR) format using UDP messages or sending XML over a TCP connection.

Ganglia Meta Daemon(gmetad)

    Ganglia Meta Daemon ("gmetad") periodically polls a collection of child data sources, parses the collected XML, saves all numeric, volatile metrics to round-robin databases, and exports the aggregated XML over TCP sockets to clients. Data sources may be either "gmond" daemons, representing specific clusters, or other "gmetad" daemons, representing sets of clusters. Data sources use source IP addresses for access control and can be specified using multiple IP addresses for failover. The latter capability is natural for aggregating data from clusters because each "gmond" daemon contains the entire state of its particular cluster.

Ganglia PHP Web Frontend

    The Ganglia web frontend provides a view of the gathered information via real-time dynamic web pages to the system administrators.

Setting up Ganglia

    Assume the following:
- Cluster: "HadoopCluster"
- Nodes: "Master", "Slave1", "Slave2". (Considered only 3 nodes, for examples. Similarly, many nodes/slaves can be configured.)
- Grid: "Grid" consists of "HadoopCluster" for now.

    gmond is supposed to be installed on all the nodes, i.e., "Master", "Slave1", "Slave2". gmetad and web-frontend will be on Master. On the Master node, we can see all the statistics in the web-frontend. However, we can have a dedicated server for web-frontend, too.

Step 1: Install gmond on "Master", "Slave1", and "Slave2".
    Installing Ganglia can be done by downloading the respective tar.gz, extracting, configuring, making and installing. But why reinvent the wheel? Let's go by installing the same with a repository.

OS: Ubuntu 11.04
Ganglia version 3.1.7
Hadoop version CDH3 hadoop-0.20.2

Update your repository packages:
$ sudo apt-get update


Installing depending packages:
$ sudo apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev

Installing gmond:
$ sudo apt-get install ganglia-monitor


Making changes in /etc/ganglia/gmond.conf: 
sudo vi /etc/ganglia/gmond.conf


/* This configuration is as close to 2.5.x default behavior as possible
   The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {                   
  daemonize = yes             
  setuid = no
  user = ganglia             
  debug_level = 0              
  max_udp_msg_len = 1472       
  mute = no            
  deaf = no            
  host_dmax = 0 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no            
  send_metadata_interval = 0    
}

/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
 * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all <HOSTS> will
 * NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
  name = "HadoopCluster"
  owner = "Master"
  latlong = "unspecified"
  url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
  location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  host = Master
  port = 8650
  ttl = 1
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
#  mcast_join = 239.2.11.71
  port = 8650
#  bind = 239.2.11.71
}

/* You can specify as many tcp_accept_channels as you like to share
   an xml description of the state of the cluster */
tcp_accept_channel {
  port = 8649
} ...

About the configuration changes:
    Check the globals once. In the cluster {}, change the name of the assumed cluster from unspecified to "HadoopCluster", change owner to "Master" (can be your organization/admin name), latlong, url can be specified according to your location. There's no harm in keeping them unspecified.
    As said, gmond communicates using UDP messages or sending XML over a TCP connection. So, lets get this clear:

udp_send_channel {
  host = Master
  port = 8650
  ttl = 1
...

This means that host receiving at the end point will be "Master", where Master is the hostname with an associated IP address. Add all the hostnames in /etc/hosts with the respective IPs. The port at which it accepts is 8650.
Since, gmond is configured now at "Master", the UDP recieve channel is 8650.

udp_recv_channel {
#  mcast_join = 239.2.11.71
  port = 8650
#  bind = 239.2.11.71
}

All the XML description that could be hadoop metrics, system metrics, etc., is accepted at port:8649.

tcp_accept_channel {
  port = 8649
} ..


Starting Ganglia:
$ sudo /etc/init.d/ganglia-monitor start
$ telnet Master 8649
The output should contain XML format.

This means gmond is up.
$ ps aux | grep gmond
And it shows gmond.

Installing gmond on Slave machines is the same with gmond.conf, as seen below:

/* This configuration is as close to 2.5.x default behavior as possible
   The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
  daemonize = yes
  setuid = no
  user = ganglia
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
  deaf = no
  host_dmax = 0 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
  send_metadata_interval = 0
}

/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
 * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all <HOSTS> will
 * NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
  name = "HadoopCluster"
  owner = "Master"
  latlong = "unspecified"
  url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
  location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond
  is used to only support having a single channel */
udp_send_channel {
  host = Master
  port = 8650
  ttl = 1
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {

#  mcast_join = 239.2.11.71
  port = 8650
 # bind = 239.2.11.71
}

/* You can specify as many tcp_accept_channels as you'd like to share
   an xml description of the state of the cluster */
tcp_accept_channel {
  port = 8649
}....


Starting Ganglia
:
$ sudo /etc/init.d/ganglia-monitor start
$ telnet Slave1 8649
Output should contain XML format.

This means gmond is up.
$ ps aux | grep gmond
It shows gmond.


Step 2 : Installing gmetad on Master:
$ sudo apt-get install ganglia-webfrontend


Installing the dependencies:
$ sudo apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev librrd2-dev

Making the required changes in gmetad.conf:
data_source "HadoopCluster"  Master
gridname "Grid"
setuid_username "ganglia"

The datasource specifies the cluster name as "HadoopCluster" and Master as the sole point of consolidating all the metrics and statistics.
The gridname is initially assumed as "Grid".
The username is "ganglia".

Check for /var/lib/ganglia directory. If not existing then,
mkdir /var/lib/ganglia
mkdir /var/lib/ganglia/rrds/
and then:
$ sudo chown -R ganglia:ganglia /var/lib/ganglia/

Running the gmetad:
$ sudo /etc/init.d/gmetad start

You can stop with:
$ sudo /etc/init.d/gmetad stop
and run in debugging mode once.
$ gmetad -d 1

Now, restart the daemon:
sudo /etc/init.d/gmetad restart


Step 3 : Installing PHP Web-frontend-dependent packages at "Master":
$ sudo apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev librrd2-dev

Check for /var/www/ganglia directory and restart the apache2
:
$ sudo /etc/init.d/apache2 restart

Time to hit Web URL:
http://Master/ganglia
In general, http://<hostname>/ganglia/

You must be able to see some graphs.


Step 4 : Configuring Hadoop-metrics with Ganglia.
On Master( Namenode, JobTracker):
$ sudo vi /etc/hadoop-0.20/conf/hadoop-metrics.properties

# Configuration of the "dfs" context for null
#dfs.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "dfs" context for file
#dfs.class=org.apache.hadoop.metrics.file.FileContext
#dfs.period=10
#dfs.fileName=/tmp/dfsmetrics.log

# Configuration of the "dfs" context for ganglia
 dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 dfs.period=10
 dfs.servers=Master:8650

# Configuration of the "dfs" context for /metrics
#dfs.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext


# Configuration of the "mapred" context for null
#mapred.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "mapred" context for /metrics
mapred.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

# Configuration of the "mapred" context for file
#mapred.class=org.apache.hadoop.metrics.file.FileContext
#mapred.period=10
#mapred.fileName=/tmp/mrmetrics.log

# Configuration of the "mapred" context for ganglia
 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 mapred.period=10
 mapred.servers=Master:8650


# Configuration of the "jvm" context for null
#jvm.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "jvm" context for /metrics
jvm.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

# Configuration of the "jvm" context for file
#jvm.class=org.apache.hadoop.metrics.file.FileContext
#jvm.period=10
#jvm.fileName=/tmp/jvmmetrics.log

# Configuration of the "jvm" context for ganglia
 jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 jvm.period=10
 jvm.servers=Master:8650

# Configuration of the "rpc" context for null
#rpc.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "rpc" context for /metrics
rpc.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

# Configuration of the "rpc" context for file
#rpc.class=org.apache.hadoop.metrics.file.FileContext
#rpc.period=10
#rpc.fileName=/tmp/rpcmetrics.log

# Configuration of the "rpc" context for ganglia
 rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 rpc.period=10
 rpc.servers=Master:8650

Hadoop provides a way to access all the metrics with GangliaContext31 class. Restart hadoop services at Master.

Restart gmond and gmetad:
$ telnet Master 8649

That will expose XML metrics of Hadoop.

On Slave1 (Secondary Namenode, Datanode, TaskTracker):
$ sudo gedit /etc/hadoop-0.20.2/conf/hadoop-metrics.properties

# Configuration of the "dfs" context for file
#dfs.class=org.apache.hadoop.metrics.file.FileContext
#dfs.period=10
#dfs.fileName=/tmp/dfsmetrics.log

# Configuration of the "dfs" context for ganglia
 dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 dfs.period=10
 dfs.servers=Slave1:8650

# Configuration of the "dfs" context for /metrics
#dfs.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext


# Configuration of the "mapred" context for null
#mapred.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "mapred" context for /metrics
 mapred.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

# Configuration of the "mapred" context for file
#mapred.class=org.apache.hadoop.metrics.file.FileContext
#mapred.period=10
#mapred.fileName=/tmp/mrmetrics.log

# Configuration of the "mapred" context for ganglia
 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 mapred.period=10
 mapred.servers=Slave1:8650

# Configuration of the "jvm" context for null
#jvm.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "jvm" context for /metrics
jvm.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

# Configuration of the "jvm" context for file
#jvm.class=org.apache.hadoop.metrics.file.FileContext
#jvm.period=10
#jvm.fileName=/tmp/jvmmetrics.log

# Configuration of the "jvm" context for ganglia
 jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 jvm.period=10
 jvm.servers=Slave1:8650

# Configuration of the "rpc" context for null
#rpc.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "rpc" context for /metrics
rpc.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

# Configuration of the "rpc" context for file
#rpc.class=org.apache.hadoop.metrics.file.FileContext
#rpc.period=10
#rpc.fileName=/tmp/rpcmetrics.log

# Configuration of the "rpc" context for ganglia
 rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 rpc.period=10
 rpc.servers=Slave1:8650


Restart the tasktracker with this command:
$ sudo service hadoop-0.20-tasktracker restart

Restart gmond with this:
$ telnet Slave1 8649
This will expose the XML format of Hadoop metrics.

A similar procedure is done to Slave1, and it must be followed by Slave2, changing Slave1 to Slave2 and repeating the rest of the procedures of restarting
hadoop's and gmond's services.

On the Master, restart gmond and gmetad with this:
$ sudo /etc/init.d/ganglia-monitor restart
$ sudo /etc/init.d/gmetad restart

Hit this web URL:
http://Master/ganglia
Check for Metrics, Grid, Cluster and all the nodes you configured.
You can also witness the hosts up, hosts down, and total CPUs.

Enjoy monitoring your cluster! :)

Is your APM strategy broken? This ebook explores the latest in Gartner research to help you learn how to close the end-user experience gap in APM, brought to you in partnership with Catchpoint.

Topics:

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}