Over a million developers have joined DZone.

Hadoop User Metrics Unleashed

DZone's Guide to

Hadoop User Metrics Unleashed

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

One of the most important things for a Product Manager is to measure the product usage. This tenet gets a bit blurry when your Hadoop platform is the product. There are tons of operational metrics available to measure job stats, errors, compute resources and storage which may help you determine the health of the platform and in turn infer user experience.  The same principles, for product managers at consumer web/mobile companies are much better defined. For example, for consumer products I have worked with, the metrics are usually around User Activity and Engagement, Revenue, and Conversion and retention.

In this post, I want to address some easy ways of obtaining user metrics on your Hadoop platform to analyze usage patterns and help you build your product roadmap based on that usage.

This could vary across Hadoop distributions available but I am using Cloudera’s Hadoop distribution. Cloudera also has an excellent tool called Cloudera Navigator “to configure, collect, and view audit events, to understand who accessed what data and how.” This may be adequate for most needs and makes viewing and auditing your platform usage so much easier. Cloudera navigator is a part of Cloudera Manager which has a robust set of APIs to integrate with existing monitoring tools.  The Cloudera Manager also comes with configurable dashboards for every metrics imaginable with near realtime tracking.

Using the Cloudera Manager API

For Product Managers who want more than what Navigator offers to customize metrics and render  rich visualizations for executive dashboards, the Cloudera Manager API (CMAPI) is a wonderful tools towards that goal. The REST API provides a rich suite of metrics that you can aggregate and rollup based on the KPIs you want to measure. As the document suggest, all endpoints act on a common set of data and the calls are returned by JSON objects you can consume in a variety of interfaces.

In this basic example, let’s look at querying the Cloudera Manager to get  stats of for users who have run jobs using YARN on our Hadoop cluster between a time range. Using Python, we will look at querying the API by feeding it a time range and specifying the number of results we want back and output the JSON file returned. 

Note: For this article, I am using Python 2.7.

Lets take one of the end points ‘yarnApplications’ – this will return attributes of the YARN containers being run by users. The key properties it will return are applicationId , application name, startTime endTime, user, pool (the pool the application was submitted to) and other attributes in a map as exposed by the API. For example create a file named ‘yarnmetrics.py’. Open the file and type in the following. 

1)  Import the requisite libraries.

fromdatetime importdatetime, timedelta

2)  Set a limiter variable for the specific number of records you want to return. By default the maximum a single call can return is 1000.

3)  We can also provide the time slice we need to extract  the data from the API. For example, based on offsets from the current time ,we can specify the a time range. In this example, I would like the API to return results between 9 AM to 5 PM.

cur_time =(datetime.now()-timedelta(days=1))
to_time=cur_time.replace(hour=17, minute=00,second=00, microsecond=0).isoformat()
from_time=yes_time.replace(hour=8, minute=0,second=0, microsecond=0).isoformat()

4)  Pass an argument to the file when invoking it. In this case we will be passing the argument that tells the script which API endpoint to call.

limiter =20
metric =sys.argv[1]

5)  The mount point of the API is of  the format : /api/v7/clusters/{clusterName}/services/{serviceName}/yarnApplications

This call will depend on the configuration of the Cloudera Manager running on your cluster. For more on automating these services and roles and for metrics, go here: http://cloudera.github.io/cm_api/docs/python-client/

Here, create a condition that executes the below call if “applications” is provided.The from_time and to_time should be provided in ISO format with the limiter as provided in step 2.

ifmetric =="applications":
url ='http://clouderamanagerendpoint:port/api/v7/clusters/<clustername>/services/<servicename>/yarnApplications?from={0}&to={1}&limit={2}&filter=''executing=false'''.format(from_time, to_time, limiter)

6) Based on your installation of Cloudera Manager, you will also need to authenticate to the server. For that we set up a function to encode the credentials using base64. Note: there are better ways of handling this passing of credentials based on your security environment. For purposes of brevity, I am providing a simple example below. For more on these look here: https://docs.python.org/release/3.1.3/library/base64.html

return"Basic "+(user +":"+password).encode("base64").rstrip()

7)  Now we are ready to make the API call. We pass our encoded data as part of the request.

8)  Make the request and print the results and write to a JSON outfile.

req =urllib2.Request(url)
req.add_header('Accept', 'application/json')
req.add_header('Authorization', 'Basic fsfadgibberishsdfdfsfF=')  # replace encodeUSerData with string

9)  At this point, we now have a JSON file with information on all the jobs that ran between our stipulated time period. 

You can run the file using the command: python yarnmetrics.py applications

This data can now be mined into the visualization tool of your choice.

For example, you could query the impalaQueries API to generate the list of Users, who are accessing your Impala database.

Or you could use the above YARN API to compute daily usage.

At this point, it would be easy to automate such functionality using tools like Oozie or Cron so that you have this data available to you at whatever time interval you specify. You can also directly write these JSON structs to the Hadoop layer if necessary if you need automated dashboards or user-driven queries to analyze this data. A sample workflow could be to write these JSON files to a Hive table which could then be accessed by a visualization tool to render dashboards. 

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.


Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}