This post describes a prototype implementation of a
simple PAAS built on the Hadoop YARN framework and the key findings from the
experiment. While there are some advantages to using Hadoop YARN, there is at
least one unsolved issue that would be difficult to overcome at this point.
Note that this was done as part of the research works at SAP Labs!
YARN is the next generation Hadoop MapReduce architecture.
It is designed to be more flexible architecturally, improve scalability, and
achieve a higher resource utilization rate, among other things. Although YARN
remains similar to the old Hadoop MapReduce architecture (let’s call it Hadoop
MR1), the two are different enough that most components have been rewritten and
the same terminology can no longer be used for both.
In short, the Hadoop YARN architecture (or called MR2)
splits the two major functions of the JobTracker of MR1 into two separate
components – central Resource Manager and Application Master per
application. This new architecture not
only supports the old MapReduce programming model, but also opens up the possibility
of new data processing models.
|Hadoop Yarn framework|
A large scale PAAS offering needs to solve many of the same
problems that Hadoop already addresses.
- Hadoop is a framework for running applications
on large clusters of commodity hardware. This commodity characteristic drives
down costs and avoids vendor lock in.
- Many companies have already deployed a Hadoop cluster
for big data processing. So it can be easier in terms of development and
operational adaptation to use it as PAAS than to adopt another PAAS solution.
This can also bring in efficiency of resource utilization as
one Hadoop cluster can be used for both big data processing and PAAS.
- Hadoop YARN’s three-layer architecture (client,
Hadoop YARN core, application instances) looks like a good fit for PAAS.
- Hadoop YARN’s monitoring and resource management
capabilities (i.e. Resource Manager and Node Managers) seem promising for PAAS.
- Hadoop is a JAVA framework, which is good for
many enterprise companies in the sense that they already have enough Java
- With HBase, Hadoop can be used as a NoSQL database,
which can be exposed as a PAAS service.
Before we jump to the prototype design and implementation,
let’s briefly take a look at a well-known PAAS, VMware’s Cloud Foundry. The
following gap analysis highlights missing components in Hadoop YARN that would
need to be added in order to create a YARN-based PaaS offering. Note that we
consider the CloudFoundry.org open source offering here, not the
CloudFoundry.com commercial offering.
|VMware's Cloud Foundry|
Cloud Foundry has three distinctive architectural layers: the
client layer, the core layer, and the services and application layer. Interestingly,
this three-layer architecture is very similar to the Hadoop YARN architecture.
Has a command-line management client called ‘vmc’ to provision/update
applications, manage services, and get information about applications
Develop a generic YARN
client to handle commands (e.g. push, start, stop, …)
Supports WAR file
based application provisioning (e.g. vmc push myapp.war)
We can mimic the same push behavior in the YARN client
mentioned in the previous item, assuming that web containers (e.g. Jetty or
Tomcat) are launched and run as application containers in nodes.
Cloud Foundry System Layer – Core Architecture
Has a Router component,
which is actually Ngnix, to route request traffic
There is nothing like
Router in Hadoop YARN. We need a high performance http proxy similar to
Health Manager of Cloud Foundry wakes up periodically and does health checks and
does a remedy if necessary
In Hadoop YARN, the Application
Master in coordination with the Resource Manager can perform a similar role.
Uses NAT as a messaging backbone among various components to
coordinate synchronization and update states
We need some
synchronization service among Hadoop components and non-Hadoop components
(e.g. Router). HBase uses ZooKeeper to provide similar functionality.
The Cloud controller of Cloud Foundry
The Resource Manager and
Node Managers do a similar role.
Droplet Execution Engines Layer – Running Services and
App Running Environment
Droplet Execution Engine is the environment to run service and application
A container is YARN’s shell
environment to run an application instance.
Java Web Container
Uses Tomcat as an application container in the Droplet
Execution Engine by default
We can use any embeddable
web container (e.g. Jetty)
register/unregister services (MySQL, NoSQL services, and so on) and provision
applications integrated with those services
One of the biggest gaps
between Cloud Foundry and Hadoop YARN. No service support in Hadoop YARN.
Supports service level Multi-tenancy. For example, the user
can provision a database service (e.g. MySQL) and bind it to his/her
application, which means that the database service instance is dedicated to
the application instances
Has limited support. Only
Container-level multi-tenancy is supported.
As we saw in the previous section, Hadoop YARN is missing
some PAAS components and capabilities when compared to Cloud Foundry. But it
still has many desirable features for a PAAS.
In order to better understand the characteristics of such a
system, we attempted to implement a basic PAAS prototype by leveraging built-in
functionality as much as possible and filling gaps with additional services as
The prototype was scoped to implement very basic PAAS
- A generic command line YARN client to provision
applications, start application instances, stop a number of running application
- Only Java Web applications are supported, and
only WAR files as application provision packages
following features were not requirements of the prototype:
- Auto scaling (elasticity)
- Multi-tenancy support
- Service (relational or NoSQL database services) provision
|PaaS Prototype Architecture|
To work with the Hadoop YARN framework, three java projects
- PAAS Client (PaasClient)
- PAAS Application Master (PaasAppMaster)
- PAAS Application Container (PaasAppContainer)
Beyond that, since we needed the capability of router to
serve as an entry point for incoming application requests and route requests to
the corresponding application instances, a java implementation of http proxy
called LittleProxy was slightly modified and used.
Furthermore, to synchronize the states of application
instances among PAAS components, ZooKeeper was used.
PAAS Client (PaasClient)
PaasClient is a dedicated YARN client that works like a
command shell to process PAAS commands. It’s written in Java. Users (i.e.
operators) can issue commands in the client to provision application files
(i.e. war files), start or stop application instances, get a list of instance
information, and get YARN application status.
Here are some example commands:
Provisions the HelloWorld application to the Hadoop
file system so that later, the PAAS Application Container can pick it up
start HelloWorld 2048
Starts 2 instances of the HelloWorld service with a maximum
memory limit of 2G
stop HelloWorld 3
Stops 3 instances of HelloWorld
Returns information about
running HelloWorld instances
PaasAppMaster is a YARN application master that manages a
lifecycle of PAAS application instances. For each ‘start’ command, one instance
of PaasAppMaster gets created, which then tries to create the requested number
of PaasAppContainer instances. Its life lasts until all the PaasAppContainer
instances stop or fail.
Container (PaasAppContainer) – Jetty Web Container
PaasAppContainer is instantiated by PaasAppMaster as a YARN
container according to the requested resource limit (e.g. memory limit of 2G).
It is a wrapper around the embedded Jetty web container that loads the war file
for the requested service (e.g. ‘start HelloWorld 2048 2’), and interacts with
Zookeeper to register and unregister itself so that the Router can update its
The Router’s main function is to route and distribute
application requests to the running application instances. We used LittleProxy,
a java implementation of HTTP proxy using Netty asynchronous networking
framework, and modified it so that it maintains a routing table in ZooKeeper
and routes requests based on the routing table.
Whenever a PaasAppContainer instance registers or
unregisters to ZooKeeper, this event is notified to Router, which in turn
updates the routing table accordingly.
ZooKeeper is a centralized service to provide distributed
synchronization or maintain configurations or provide group services. Zookeeper
was used for the prototype for two main reasons:
To keep track of running application instances.
PaasAppContainer instance registers or unregisters itself to ZooKeeper when it
starts or stops. This information is used by Router.
To stop application instances in coordination with
PaasAppContainer instances as YARN does not provide a functionality to stop
For simplicity, application war files are provisioned to
Hadoop file system (HDFS) so that PaasAppContainer instances can pick them up
when a ‘start’ command gets executed. This happens in the following order:
An Admin user issues a push command in
PaasClient like “push /Users/PAAS/svcA.war”, where svcA.war file is a web
application file of the service named ‘svcA’. Then the PaasClient uploads the
file to the Hadoop server by using ‘scp’ and ‘ssh’. There is a dedicated
directory in the Hadoop file system to save War files (e.g. /PAAS/), while
non-application libraries and jar files are stored in a different directory.
Assuming that all the PAAS system jar files (e.g.
PaasAppContainer and PaasAppMaster) and an application war file “svcA.war” are
already provisioned, here are the steps to start up application instances:
Admin user issues a start command in PaasClient
like “start svcA 2048 3” which is to start 3 “svcA” application instances of 2G
PaasClient connects and requests to Hadoop
Resource Manager one container for PaasAppMaster. Once it gets one allocated,
it issues a request to start PaasAppMaster in the container with some
parameters such as what application to run and how many instances to start.
PaasAppMaster connects to Resource Manager again
and requests for containers for PaasAppContainer instances as many as the
requested number of instances. After it gets containers allocated, it issues
requests to start up PaasAppContainer instances for the application name “svcA”
When PaasAppContainer instances get started, they
load the corresponding WAR file from Hadoop file system and starts embedded
Jetty web container. After the Jetty web container is successfully started, the
PaasAppContainer instance registers itself to ZooKeeper with the service url of
the Jetty web container.
This registration event will be notified to
Router, and then Router updates its routing table. At this point, the
application instance is ready to serve app users’ requests.
|Flow - Start instances|
Because Hadoop YARN does not provide a way to stop an
individual container, the prototype uses ZooKeeper’s synchronization capability
to process stop commands:
- Admin user issues a stop command in PaasClient
like “stop svcA 2” which means to stop 2 instances of svcA application.
- PaasClient chooses randomly as many instances as
the requested number among the running app instances and deletes the
corresponding registration information from ZooKeeper.
- These deletion (i.e. un-registration) events
will be notified to the corresponding PaasAppContainer instances, and then the
instances stop themselves. Although we can do a more intelligent container
stopping process such as waiting for all the existing sessions to complete
their works for a reasonable duration before stopping the container, we chose
the simplest approach of stopping right away for the prototype. This is
certainly a limit of the prototype.
|Flow - Stop instances|
We used Hadoop version 0.23.1. The Hadoop YARN cluster
consisted of 5 VMs. One VM was dedicated as Resource Manager, while the others
were application nodes. Router(modified
LittleProxy) and ZooKeeper were run on a separate machine.
We found certain things positive while developing the
prototype and testing with it:
- Prototype development was fast. Within a week,
one person could write a basic flow including starting application instances
even though he had a very limited knowledge about YARN at the beginning.
- As expected, resource management (container
allocation based on a resource requirement) is done well.
- Hadoop file system is certainly good enough as an
Also, we found some negative things, too:
- The current version Hadoop YARN supports only
memory-based container allocation. But we’re told that this is going to be
improved in coming releases.
- There is a very limited set of monitoring APIs.
For example, there is no easy way to monitor the health state of each
Application container (e.g. memory consumption, and more). But again, we're told that Hadoop team is working on more monitoring capabilities now.
And there is one known issue of the prototype
- Each ‘start’ command creates a new instance of
PaasAppMaster , even though there might already be a running PaasAppMaster.
This might happen, for example, when a user with a running application tries to
add more PaasAppContainer instances to handle additional load. After we had
implemeneted the prototype, we were told that this problem can be fixed by
using YarnRPC protocol, but we didn’t try to fix it due to a development time
We successfully implemented a PAAS prototype whose scope is
limited to basic PAAS functionalities (i.e. provision applications, start/run/stop
multiple application instances in a random fashion in a Hadoop cluster). For
the most part, we were able to fill the gaps we identified in the beginning of
the process and built a working PAAS relatively easily and quickly. Hadoop
YARN’s memory based resource management worked very well for the PAAS
We stopped the experiment because of a few critical
limitations would take significant work to overcome to use Hadoop YARN as a
PAAS, including Hadoop YARN’s programming model (i.e. map-reduce
or DAG programming model) and limited monitoring capabilities.
We have demonstrated that it’s already usable as PAAS for a
limited scope by adding additional components and services. Over the long term, Hadoop YARN is a promising
foundation for a PAAS, considering that the architecture is still evolving and
the above-mentioned problems may yet be addressed.
Prototype Source Code
The prototype implementation is hosted at GitHub.