Thank you for reading our Data Lake 3.0 series! In Part I of the series, we briefly introduced the power of leveraging prepackaged applications in Data Lake 3.0 and how the focus will shift from the platform management to solving the business problems. In this post, we further deliberate on this idea to help answer questions on how a multi-colored YARN will play a critical role in building such a successful Data Lake 3.0.
Apache Hadoop YARN is the modern resource-management platform that enables applications to share a common infrastructure of servers and storage. YARN is now morphing into a multi-colored platform of choice! YARN’s vision has always been to enable Hadoop to run many different workloads. The next steps in the journey are about dialing up the workload diversity and in making the creation and deployment of modern data apps easy. Without further ado, let’s first recap on how YARN acted as the platform of choice thus far, before elaborating on the evolution of a multi-colored YARN as part of Data Lake 3.0.
Towards a multi-colored YARN: apps, services, and assemblies.
Data Lake 2.0: YARN as the Platform
Apache Hadoop YARN is built as a general purpose resource management platform. YARN’s core concepts are applications, containers, and resources. A container is a virtualized execution environment where a set of processes or tasks utilize the physical resources of the underlying machine. Administrators set up a bunch of machines to support multiple such containers. Users then write applications, each to be a set of tasks or processes executing in a collection of containers.
"YARN's core concepts: applications, containers, and resources."
Making use of these concepts of applications and containers, YARN has been used successfully to run all sorts of data applications. These applications can all coexist on a shared infrastructure managed through YARN’s centralized scheduling.
YARN’s Key Strengths
YARN is being used in production at a wide variety of organizations to host a wide variety of data-intensive applications such as batch workloads (Hadoop MapReduce), interactive query processing (Apache Hive, Apache Tez, Apache Spark), and real-time processing (Apache Storm). For those of you who are familiar with its history, it originated out of a need to evolve Hadoop to support not just MapReduce but any arbitrary processing engine. As more engines mentioned above came to the fore over time, YARN’s core architectural design has served the needs of these engines well, needing only occasional incremental improvements. Over the years, YARN has easily supported a wide spectrum of frameworks.
The power of YARN is not limited to just enabling all these different programming paradigms on shared datasets (typically over a distributed storage system like HDFS) and physical hardware. YARN brings to the table a variety of platform features that users rely on for an end-to-end big data success story. YARN can use its key strengths — cost effective resource management, powerful scheduling primitives, resource isolation and multi-tenancy — on a myriad of resources, varying from small pools of special-purpose machines to datacenter-scale infrastructure built out of commodity hardware.
Data Lake 3.0: A Multi-Colored YARN
YARN is the data operating system that powers our Data Lake 3.0 vision. While YARN has initially focused on large-scale but short-running apps (often also referred to simply as jobs), it is also the perfect platform to run long-running services as well as apps that have a mix of both. YARN’s scheduler and its key abstractions are general enough to support running a variety of application including batch, longer running streaming, and classical services. However, what separates YARN from others is its special support for data intensive applications.
- The scheduler supports rich placement strategies like data locality so that applications can be placed close to its data. It has sophisticated algorithms to allow the efficient and incremental exchange of data locality information from data intensive applications as they progress through their multiple parallel phases while dealing with the fact that data is often also replicated.
- YARN offers a distributed cache for caching both binaries and data on the local machines where the real work is done.
- YARN provides local temporary storage that data processing apps can leverage. This is also useful for longer-running service-like applications.
Extending YARN’s inherent capabilities to handle data intensive applications, we are seeing significant signals of a perfect storm enabled by two major drivers. On the business front, our advanced users are looking to solve end-to-end business problems as the next phase in the big data maturity curve. On the technology front, we are seeing the wide adoption of containerized workloads that provides the ease of distribution, packaging, and isolation.
Market Drivers: End-to-End Business Use Cases
Let’s revisit the historical way the Hadoop ecosystem has been built over time. Since the beginning, the Apache ecosystem has focused on singular storage and compute engines, each addressing a specific problem in the larger big-data space. This is akin to the Unix mantra of “doing one thing and doing it well.”
So far, this approach has served well the developer community and the user-base. Developer community could zoom into a single (set of) problem(s) with undivided attention and solve them all the way through. Users could then bring these different but ultimately working-well-together tools in addressing their business use-cases.
Use Cases Evolution
During the past few years, though, end-to-end business use-cases have evolved to another level.
- The end-to-end business problems are now mostly solved by multiple applications working together.
- As the platform matured, users have increasingly started wanting to solely focus on the business application layers and on getting impatient to get on with developing their main business logic.
- However, YARN, and for that matter any other related platform, hasn’t catered to this evolving need, leaving the users to unwillingly get involved in the painstaking details of wiring applications together, keeping them up, manually scaling them as need arises, etc.
Manual plumbing of all these different colored services in tiresome! Further, there is a clear need for seamless aggregate deployment, lifecycle management, and application wire-up. This is the gap that needs to be bridged between what these end-to-end business use-cases need from the platform and what the platform offers today. If these features are provided, then the business use cases authors can singularly focus on the business logic.
"Modern data applications — assemblies — span across multiple tools and must be 100X easier to build, wire up, deploy, manage, monitor, scale, secure, and govern!"
We thus want to enable businesses care a bit less about the infrastructure and more on driving the end-to-end user-cases. We call this end-to-end business application an Assembly.Further, by starting assumptions, applications need to be composable and reusable. Once a service (like Kafka on YARN) or an end-to-end application (like an IOT app) is made to work well, other members of the community should be able to simply build more complex structures using these existing components.
Why Not Static Management?
You may be wondering: Why not statically manage all these applications, services, and assemblies?
Absent a unified platform supporting these complex applications in a first class manner, administrators will have to resort to manual and upfront but error-prone capacity planning and static allocation of resources to their use-cases.
This approach quickly leads to a place where the operators manage one cluster of jobs and interactive queries running on YARN today, a separate set of machines say running HBase, 10 more machines running Kafka, Storm and the like. Part of this is also done today for more firm isolation between batch, ad hoc, and streaming applications.
Now, imagine the management of thousands of such machines (at scale) in the presence of various types of faults — hardware, network, data center etc. The YARN cluster is already equipped with dealing failures, so nodes and racks going down and coming back up is not a problem there. However, if you lose a few machines in the other HBase or Kafka clusters, the operator will have to issue a site alert for a likely SLA miss scenario, manually move machines around, rewire the networks if need be before the service comes back up.
(Admins do typically leave some buffer resources — leading to waste — for each of these clusters, but when significant faults hit, the need for manual interventions remain all the same.)
Similarly, if the business application sees unprecedented success, scaling the entire service translates to manually calculating and scaling individual services – a non-trivial task to perform while keeping the lights on. One cannot react to hardware or utilization changes without manual intervention in this model.
This type of ad hoc management works at a small scale but not desirable at larger scales given the ubiquity of hardware failures, need for upfront capacity planning, and manual scaling and elasticity. This is fundamentally the same resource-management problem that YARN is built to address!
YARN Next: Assembly of Use Cases!
Simplified deployment and scaling, enhanced discovery, management, monitoring of assemblies as a unit are some of the needs from the platform. An assembly can further be a fundamental unit of version control (of business logic), component-reuse, and security.
App, assemblies, and the YARN platform.
Why not build assemblies manually? Beyond simple application and services, assemblies managed manually are a much tougher problem both for operators and the application developers
- By definition, they are much more complicated than services done otherwise manually.
- They have more complex discovery needs — Service A needs to find Service B when B itself may be dynamically running on an arbitrary set of machines.
- Being able to start and stop all of the assemblies together is not simple.
- Scale all of the assembly together as a unit when a CXO comes in and says, “We got more input coming in, I don’t care how you scale individual pieces but do scale the entire machinery within the next X minutes.”
- Further, taken as a unit, beyond the individual components, end-users are typically interested in managing the assembly as a unit, monitor unit, get metrics etc. For example, at a high level, before worrying about whether a single HBase RegionService is down or not, the interest is in whether my assembly is healthy or not.
Having the platform enable automated management of assemblies frees up significant productivity towards building and managing higher order apps.
Technology Drivers: The Containerization Revolution
On the technology front, there is another revolution underway in the industry: containers. Simply put, containers are a lightweight virtualization mechanism for executing programs in isolated containers, popularized by the open-source technology docker. While restricted to processes, they offer the same isolation and resource management benefits as virtual machines, but with a very little overhead. Further, they have packaging mechanisms offering the same management simplicity as VM images.
YARN always had a notion of a logical container. It can be a single application process, a group of processes forming a process-tree, or a process-tree set under a memory or CPU cgroup.
With Docker, we can now also enable users to leverage industry-standard packaging of bits.
The packaging story is one of the cornerstones of enabling varied types of applications. To this end, the YARN community has been working on the native integration of docker containers in YARN. The primary effort in this area is the support for container runtimes in YARN so that in addition to process-tree containers, one can run docker containers.
And to top it all, irrespective of the container types, the users can make use of the same old platform feature like isolation, queuing models, scheduling strategies, etc.
Conclusion: Containers, First-Class Services, and Assemblies on YARN
The aforementioned new use-cases still deserve the same set of powerful platform features that short running disparate jobs have long enjoyed – multi-tenancy, massive scale, security, elastic sharing etc. Not reinventing the wheel and simply reusing these platform features will also be a massive productivity boost.
We are close to delivering a kaleidoscopic YARN to encompass all these different use-cases, with much more agility.
To this end, YARN community is working towards enabling containers, long-running services, and complex assemblies in a first-class manner. YARN as a technology has always had the right foundations to support a wide variety of applications and services. So, the next leg of our journey is going to focus on making simplified application authoring and packaging, simplified and first-class services, and enabling the notion of reusable, composable assemblies.
Please stay tuned for more upcoming blogs as part of our Data Lake 3.0 series where we attempt to shed more light on some of the concrete sub-efforts that are happening in the Apache Hadoop YARN community. We first follow-up this one with another exciting blog which puts them all together in showcasing a Deep Learning TensorFlow Assembly deployed on YARN’s cluster-wide resources (including GPUs).