Apache Hadoop 3.1: A Giant Leap for Big Data
Apache Hadoop 3.1, building on Apache Hadoop 3.0, is the core enabling big data technology as we march into the fourth industrial revolution.
Join the DZone community and get the full member experience.Join For Free
Source: Nvidia Blog Into the Woods: This Drone Goes Where No GPS Can
When we are in the outdoors, many of us often feel the need for a camera — one that is intelligent enough to follow us, adjust to terrain heights, and visually navigate through obstacles, all while capturing panoramic videos. Here, I am talking about autonomous self-flying drones, very similar to cars on autopilot. The difference is that we are starting to see the proliferation of artificial intelligence into affordable, everyday use cases, compared to relatively expensive cars. These new use cases mean:
- They will need parallel compute processing to crunch through an insane amount of data (visual or otherwise) in real-time for inferences and training of deep learning neural network algorithms. This helps them distinguish between objects and get better with more data. Think like a leap of compute processing by 100x due to the real-time nature of the use cases.
- They will need deep learning software frameworks so that data scientists and data engineers can deploy them as containerized microservices closer to the data and more quickly — think like a leap from taking weeks to taking minutes to deploy.
- They will generate tons of data to analyze — think like a leap from petabyte-scale to exabyte scale.
A Giant Leap
Recently, Roni Fontaine at Hortonworks published an article titled How Apache Hadoop 3 Adds Value Over Apache Hadoop 2, capturing the high-level themes. Now, we are glad to announce the official general availability of Apache Hadoop 3.1. This might seem like a small step, but this is a giant leap for the big data ecosystem. Apache Hadoop 3.1, building on Apache Hadoop 3.0, is the core enabling big data technology as we march into the fourth industrial revolution. This blog series picks up from the Data Lake 3.0 series last year and in this series, we will capture our technology deep dive, performance results, and joint blogs with our valued partners in the ecosystem.
The diagram below captures the building blocks together at a high level. If you have to tie this back to a fictitious self-flying drone company, the company will collect tons of raw images from the test drones' built-in cameras for computer vision. Those images can be stored in the Apache Hadoop data lake in a cost-effective (with erasure coding) yet highly available manner (multiple standby namenodes). Instead of providing GPU machines to each of the data scientists, GPU cards are pooled across the cluster for access by multiple data scientists. GPU cards in each server can be isolated for sharing between multiple users.
Support of Docker containerized workloads means that data scientists/data engineers can bring the deep learning frameworks to the Apache Hadoop data lake and there is no need to have a separate compute/GPU cluster. GPU pooling allows the application of the deep learning neural network algorithms and the training of the data-intensive models using the data collected in the data lake at a speed almost 100x faster than regular CPUs.
If the customer wants to pool the FPGA (field programmable gate array) resources instead of GPUs, this is also possible in Apache Hadoop 3.1. Additionally, use of affinity and anti-affinity labels allows us to control how we deploy the microservices in the clusters — some of the components can be set to have anti-affinity so that they are always deployed in separate physical servers.
Now, the trained deep learning models can be deployed in the drones in the outdoors, which will then bring the data back to the data lake. Additionally, the YARN native services API exposes the powerful YARN framework programmatically and in a templatized manner. This is key to building an ecosystem of microservices on the Apache Hadoop data lake powered by YARN.
To summarize, Apache Hadoop 3.x architecture enables various use cases:
- Agility: Containerization support provides isolation and packaging of workloads and enables us to lift/shift existing as well new workloads such as deep learning frameworks (TensorFlow, Caffe, etc.). This enables data-intensive microservices architecture through the native YARN services API and brings the microservices closer to where data is. This also avoids creating yet another cluster for hosting the microservices away from the data.
- New use cases such as graphical processing unit (GPU) pooling/isolation: GPUs are expensive resources and we want to enable our data scientists to share them quickly for proofs of concepts. GPU pooling and isolation help democratize the access across the company or department.
- Low total cost of ownership: Erasure coding helps reduce the storage overhead from 200% to 50% as the volume of data grows.
Please stay tuned to this blog series — we will bring a lot of exciting content! You can also read more about Hadoop 3.1 here.
Published at DZone with permission of Wangda Tan, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.