MapReduce and Yarn Part 2: Hadoop Processing Unit
Join the DZone community and get the full member experience.Join For Free
In my previous article, we learned about MapReduce. In this, we will focus on YARN, which enhances the power of Hadoop. YARN is not a competitor of Mapreduce but a framework to help perform Hadoop better. It's also referred to as Hadoop 2.
There were some major issues in the MapReduce paradigm, such as the centralized handling of job control flows and tight coupling of programming models with management infrastructure. With Yarn, Hadoop has become available for third party plugins and other data sources to be processed. It splits up two major functionalities of JobTracker, resource management and job scheduling/monitoring, into separate daemons. The main components of YARN are:
It is responsible for the distribution of cluster resources and decides the allocation of the available resources to the applications. It keeps all resources in the cluster in use and hence enhances system utilization. The Scheduler in the Resource Manager is responsible for partitioning the cluster among various applications. The application manager here is responsible for accepting ob submissions and provides service restarts in case of failure.
You may also like: Apache Spark on YARN – Performance and Bottlenecks.
The Node Manager works on the instructions given by the Resource Manager. It manages user jobs and workflows on the given node. It keeps up to date with Resource Manager by sending regular heartbeats of its status. It also monitors container CPU usage and performs log management.
This is the native application environment that deals with running the job. The main function of the Application Master is to negotiate for the resource from the Resource Manager and work with the Node Manager to execute a task. It is also responsible for fault management in an application. It continuously sends a heartbeat to the resource manager to send CPU reports and negotiate for resources.
This is the physical unit with fixed resources of RAM, CPU cores, etc. Containers are managed with the Container Life Cycle. This record contains a map of environment variables, dependencies stored in remotely accessible storage, security tokens, the payload for Node Manager services, and the command necessary to create the process. It grants rights to an application to use a specific amount of resources (memory, CPU, etc.) on a specific host.
Client submits an application.
Resource Manager allocates a resource (container).
Application Manager gets registered with the Resource Manager.
The Application Manager asks for the required resources from the Resource Manager.
The Application Manager notifies Node Manager to launch containers.
The application code is executed in the container.
Application master sends application diagnostics to Resource Manager/Application Manager to monitor the application’s status.
Application Manager removes an entry from Resource Manager.
Opinions expressed by DZone contributors are their own.
Logging Best Practices Revisited [Video]
RBAC With API Gateway and Open Policy Agent (OPA)
Exploratory Testing Tutorial: A Comprehensive Guide With Examples and Best Practices
Cypress Tutorial: A Comprehensive Guide With Examples and Best Practices