Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Properly Collect AWS EMR Metrics

DZone 's Guide to

How to Properly Collect AWS EMR Metrics

When it comes to metrics, AWS does not supply a proper solution for collecting cluster metrics. Click here to learn how to properly collect AWS EMR metrics.

· Performance Zone ·
Free Resource

Working with AWS EMR has a lot of benefits. But, when it comes to metrics, AWS currently does not supply a proper solution for collecting cluster metrics from EMRs.

Well, there is AWS Cloudwatch, of course, which works out of the box and gives you loads of EMR metrics. The problem with CloudWatch is that it doesn't give you the ability to follow metrics per business unit or a tag — only per a specific EMR ID. This simply means that you cannot compare the metrics over time and only for specific EMRs.

Let me explain the problem again. A common use of EMR is that you write some kind of code that will be executed inside an EMR and will be triggered every given amount of time —let's say every five hours.

This means that every five hours, a new EMR, with a new ID, will be spawned. In CloudWatch, you can see each of these EMRs individually but not in a single graph, which is huge a disadvantage.

Just to note, I am referring only to machine metrics, like memory, CPU, and disk. Other metrics, like JVM metrics or business metrics, are usually collected by the process itself and, obviously, can be collected over time per business unit.

Another problem is that some of these metrics demand extra cost, so they would be collected and displayed by Cloudwatch.

I found a nice and easy solution to this problem. I wrote a small script that collects metrics from the machine it is executed on for every given amount of time and sends those metrics to a Graphite host. This script should be added to the EMR as a bootstrap action. In this way, all of the cluster machines will send their metrics to Graphite. Since the script would use the same namespace all the time, the outcome graph will show you not only the metrics for the current execution, but also for the history of previous executions.
Also, I used Graphite, because this is what we prefer over at my job. But, the same solution could easily be used for other APIs instead, like AWS CloudWatch API.

export GRAPHITE_HOST="metrics.mydomain.com"
export GRAPHITE_PORT=2003
 
export TIER=$(aws emr describe-cluster --cluster-id $(sudo cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId") --query Cluster.Tags | jq -r -c '.[] | select(.Key | contains("Tier"))? | .Value'  | tr '[:upper:]' '[:lower:]') || exit 1
 
export IS_MASTER=$(cat  /mnt/var/lib/info/instance.json | jq -r ".isMaster") || exit 1
 
if [[ $IS_MASTER == "true" ]]; then
   export namespace_prefix="master"
else
   export namespace_prefix="nodes"
fi
 
send_loop()
{
   while :
   do
      echo "${1}.${namespace_prefix}_free_memory `free -m | awk -v RS="" '{print $10 "+" $17 "+" $21}' | bc` `date +%s`" | nc  ${GRAPHITE_HOST} ${GRAPHITE_PORT}
      echo "${1}.${namespace_prefix}_cpu_utilization `top -b -n1 | grep "Cpu(s)" | awk '{print $2 + $4}' | bc` `date +%s`" | nc  ${GRAPHITE_HOST} ${GRAPHITE_PORT}
      echo "${1}.${namespace_prefix}_free_disk `df --output=avail / | grep -v Avail | bc` `date +%s`" | nc  ${GRAPHITE_HOST} ${GRAPHITE_PORT}
      sleep ${2}
   done
}
 
 
send_loop $1 $2 &


This script currently sends three types of metrics — available memory, CPU usage, and free disk space.

Those metrics are actually aggregated separately for the master node and worker nodes. So, in fact, we have here six metrics.

  1. This script has two parameters:
    A namespace prefix. This namespace prefix will be attached with each of the three metrics and for master and workers nodes. So, in case the namespace is a.b, the metrics that will be sent will be:
    a.b.master_free_memory
    a.b.master_cpu_utilization
    a.b.master_free_disk
    a.b.nodes_free_memory
    a.b.nodes_cpu_utilization
    a.b.nodes_free_disk

  2. The frequency in seconds that the metrics should be collected and sent to Graphite will look like the following:

How to Use the Script

  1. Copy it and set the GRAPHITE_HOST value inside the script.

  2. Upload it to S3.

  3. Add it as a bootstrap action to your EMRs.

  4. Set the 2 input parameters mentioned above.

The result of this will look like the following:

Topics:
emr ,metrics ,monitoring ,graphite ,performance

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}