MapReduce and Yarn Part 2: Hadoop Processing Unit

DZone 's Guide to

MapReduce and Yarn Part 2: Hadoop Processing Unit

In this article, we discuss fundamental concepts behind YARN, including its architectural components (Resource and Node Managers) and its workflow.

· Big Data Zone ·
Free Resource

In my previous article, we learned about MapReduce. In this, we will focus on YARN, which enhances the power of Hadoop. YARN is not a competitor of Mapreduce but a framework to help perform Hadoop better. It's also referred to as Hadoop 2.

Hadoop 1.0 vs Hadoop 2.0

Hadoop 1.0 vs Hadoop 2.0

YARN (Yet Another Resource Negotiator)

There were some major issues in the MapReduce paradigm, such as the centralized handling of job control flows and tight coupling of programming models with management infrastructure. With Yarn, Hadoop has become available for third party plugins and other data sources to be processed. It splits up two major functionalities of JobTracker, resource management and job scheduling/monitoring, into separate daemons. The main components of YARN are:

Single vs Multi-Purpose Platform

Single vs Multi-Purpose Platform

Resource Manager

It is responsible for the distribution of cluster resources and decides the allocation of the available resources to the applications. It keeps all resources in the cluster in use and hence enhances system utilization. The Scheduler in the Resource Manager is responsible for partitioning the cluster among various applications. The application manager here is responsible for accepting ob submissions and provides service restarts in case of failure.

You may also like: Apache Spark on YARN – Performance and Bottlenecks.

Node Manager

The Node Manager works on the instructions given by the Resource Manager. It manages user jobs and workflows on the given node. It keeps up to date with Resource Manager by sending regular heartbeats of its status. It also monitors container CPU usage and performs log management.

YARN Architecture

YARN Architecture

Application Master

This is the native application environment that deals with running the job. The main function of the Application Master is to negotiate for the resource from the Resource Manager and work with the Node Manager to execute a task. It is also responsible for fault management in an application. It continuously sends a heartbeat to the resource manager to send CPU reports and negotiate for resources.


This is the physical unit with fixed resources of RAM, CPU cores, etc. Containers are managed with the Container Life Cycle. This record contains a map of environment variables, dependencies stored in remotely accessible storage, security tokens, the payload for Node Manager services, and the command necessary to create the process. It grants rights to an application to use a specific amount of resources (memory, CPU, etc.) on a specific host.

YARN WorkFlow

  • Client submits an application.

  • Resource Manager allocates a resource (container).

  • Application Manager gets registered with the Resource Manager.

  • The Application Manager asks for the required resources from the Resource Manager.

  • The Application Manager notifies Node Manager to launch containers.

  • The application code is executed in the container.

  • Application master sends application diagnostics to Resource Manager/Application Manager to monitor the application’s status.

  • Application Manager removes an entry from Resource Manager.

Further Reading

hadoop 2 ,hadoop & big data ,yarn ,spark 2.0.0 ,mapreduce ,mapreduce optimization ,ambari ,apache 2.0 ,athena

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}