Kubernetes, Alluxio, and the Disaggregated Analytics Stack

DZone 's Guide to

Kubernetes, Alluxio, and the Disaggregated Analytics Stack

Alluxio announces that they now have support of the Helm chart and illustrate how bring the Kubernetes stack together.

· Cloud Zone ·
Free Resource

Image title

A Kubernetes stack isn't always as neat as you'd like.

First, the news – Alluxio support for Kubernetes Helm charts is now available! Kubernetes is a certified environment for Alluxio. Now the take away: Alluxio brings back data locality for the disaggregated analytics stack in Kubernetes. How? Read on.

There’s no arguing about the rise of containers in real-world deployments over the past few years. Containers simplify running applications in any environment and Kubernetes further transforms the way software and applications are deployed and scaled, regardless of environments. In fact, Kubernetes is increasingly seen as a key technology that enables not only easy resource orchestration in the data center but also in hybrid and multi-cloud environments.

You may also enjoy:  Container Adoption Today: Advantages and Challenges 

While containers and Kubernetes work exceptionally well for stateless applications like web servers and even completely self-contained databases like MongoDB, Couchbase, and others, the stack looks a bit different in the world of advanced analytics and AI.

The modern analytical stack is a highly disaggregated stack. Unlike traditional databases or data warehouses, the new stack is split apart.

  1. Pick a data lake or two or three to store data (S3, GCS, HDFS, etc.)
  2. Pick a computational framework to analyze data (Apache Spark, Presto, Hive, TensorFlow, etc.)
  3. Make sure all the other dependencies like the catalog service are available (Hive Metastore, AWS Glue, KMS, etc.)
Cloud Stack

Challenges Running the Disaggregated Analytics Stack in Kubernetes

Kubernetes greatly simplifies the complexity of deploying so many distributed systems together. And over time, advanced analytics running on Kubernetes clusters will become the norm. But there are still a few critical gaps to make this modern analytical stack effective.

Challenge #1 – No Shared Data Access/Caching Layer in The Kubernetes Cluster

Kubernetes is a fantastic container orchestration technology and with tools like Helm charts, operators and more, deployment can be greatly simplified. However, for data-intensive workloads like advanced analytics typically need data sharing between jobs to be effective so that data from one job can be easily be accessed by the next job. Without a data access/caching layer, the data needs to be written back to the data lake and then needs to be read back into the Kubernetes cluster again significantly slowing down data pipelines.

Challenge #2 – Lost Data Locality

With data being stored in S3 or other cloud object stores or on-prem in Hadoop, to perform analytics within the Kubernetes cluster, users have a couple of options. Data needs to either be accessed remotely (meaning poor performance) or needs to be manually copied into the Kubernetes cluster (meaning a lot more additional DevOps and management on a per workload basis). And oftentimes this will carry the burden of managing the differences between those copies which can be hard. The ideal solution is for data locality to be recreated in this disaggregated stack.

Challenge #3 – No Data Elasticity for Elastic Compute

The beauty of Kubernetes is the flexibility it gives to even the most complex compute workloads – scale up, down, upgrade, restart, and more based on need and demand. But again, the dependency on data being available to compute remains for data-intensive workloads. To scale compute in, out, up or down, the data within Kubernetes also needs to be able to do the same to leverage the power of the flexibility Kubernetes brings.

Data orchestration can solve these challenges by syncing data into the Kubernetes cluster and allowing for seamless in-memory data access and flexibility to share data across jobs and scale in or out as needed.

The News

Alluxio has had a Docker container for a while, but with Alluxio version 2.1, Kubernetes becomes a first-class environment for Alluxio with advanced testing and certification of Kubernetes. We are now seeing more production deployments with Alluxio and compute frameworks like Presto and Spark in Kubernetes.

Also new with Alluxio version 2.1, Alluxio is available for deployment via Helm Charts.

Further Reading

Kubernetes and MEAN Stack for Microservices Development

Top 10 Tips for Making the Spark + Alluxio Stack Blazing Fast

kubernetes ,alluxio ,data orchestration ,helm chart ,kubernetes stack ,data locality

Published at DZone with permission of Dipti Borkar . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}