Enabling Hyper-scale Data Workloads in the Cloud
Alluxio talks hyper-scale data workloads and more in the feature for their latest software release.
Join the DZone community and get the full member experience.Join For Free
This article talks about how the open source Alluxio community thinks about the challenges to support next-generation, hyper-scale, data-driven workloads in a cloud environment. The Alluxio team just announced the availability of Alluxio 2.0 Preview Release, the largest open source release in Alluxio with the most new features and improvements to address these challenges and enable more innovations in the stack.
The Ideation and Design Phase
When the core project team of Alluxio open source community started to think about the next big release many months ago, there were a few overarching goals that they wanted to achieve. While Alluxio 1.x already enabled data locality and data accessibility for many big data workloads in the cloud, there was still innovation needed in key areas.
- Design a step-function change at scale: As the data orchestration layer between compute and storage that makes data mobile and more accessible across many different storage systems (HDFS, objects stores, network attached storage), over time the scale of metadata support that Alluxio needs to provide could easily surpass that of the largest Hadoop deployments. Metadata management, in particular, has been known as a weak spot for Hadoop, however, Alluxio should turn metadata management into a strength.
- Power more data-driven workloads: Alluxio was created with a focus on Hadoop-based compute workloads. Over the years, however, the number of and types of data-intensive compute workloads have exploded and data orchestration and engineering to enable these workloads on existing data or new data storage systems have been non-trivial. In particular, a lot of data engineering including manual data movement is needed prior to machine learning and deep learning training. Alluxio should greatly simplify this by providing a native-known API to data scientists while reducing the data engineering required.
- Make separation of storage and compute easier: Data silos across the enterprise are only increasing with data across multiple Hadoop clusters, increasingly in many different object stores and in several cases stored on-premise or in the public cloud. This has made it harder to disaggregate compute from data because data locality and access gets severely affected when the data processing is moved to a different place than where the data is stored. Alluxio should continue to enable separation of compute and storage, by abstracting storage while making data more accessible.
With these lofty goals in mind, the engineering and product teams designed, implemented, tested and stress-tested some more, turning Alluxio 2.0 into reality. Alluxio 2.0 Preview Release is the largest open source release from Alluxio community, with the most new features and improvements since the creation of the project. It is now available for download.
The Advancements and Features
Alluxio 2.0 includes many enhancements to support the design goals of the project, all of which are open source and will be included in the Community Edition!
How to Support Hyper-Scale Data Workloads
- Support for more than 1 billion files: Alluxio 2.0 introduces a new option for tiered metadata management to support single cluster deployments with more than a billion files. We use RocksDB for off-heap storage, which is now the default. Metadata for hot data continues to be stored in the process memory on heap while the rest is managed by Alluxio outside the process memory. The property "alluxio.master.metastore" can be configured to change to only heap.
- Highly distributed data services: 2.0 introduces the Alluxio Job Service, a distributed clustered service, that data operations such as replication, persistence, cross storage move and distributed load now use, for enabling high performance and massive scale. Take a look at all the file system APIs Alluxio supports.
- Adaptive replication for increased data locality: New feature to configure a range for the number of copies of data stored in Alluxio that are automatically managed. Property "alluxio.user.file.replication.max" and "alluxio.user.file.replication.min" can be used to specify a range. A full list of all the user configurations can be found here.
- High availability with the embedded journal: A new fault-tolerance and high-availability mode for file and object metadata called the embedded journal that uses the RAFT consensus algorithm and is independent of any other external storage systems. This is particularly helpful for abstracting object storage. Learn about configuring embedded journal here.
How to Enable Machine Learning and Deep Learning Workloads on Any Storage
Machine learning and deep learning frameworks need to extract data from Hadoop and object stores, typically a very manual and time-consuming process.
- Alluxio POSIX API: Alluxio’s FUSE feature enables a POSIX compatible API so that frameworks like TensorFlow, Caffe, and other Python-based models can directly access data from any storage system via Alluxio using traditional file system access. Learn more about the POSIX API.
How to Enable Completely Independent and Elastic Compute
- Support for HDFS clusters across different versions: The explosive growth of data has led enterprises to have many data silos including multiple Hadoop clusters across many different versions. Unified access across these clusters is currently very difficult. With Alluxio 2.0, users can connect to multiple HDFS clusters with any version to Alluxio and unify data access across them. Find the list of supported HDFS versions here.
- Active sync with Hadoop: New capability integrates with HDFS iNotify to update any data and metadata changes that happen to files stored in Hadoop allowing for applications accessing data via Alluxio to proactively receive the latest updates.
This is where you come in. With the preview now available, I sincerely hope you give Alluxio 2.0 a try and share your experiences with us — we want to hear about what you are excited about, what you think could work better and what you feel we should focus on next. I personally look forward to hearing your stories. Reach out to Alluxio community on Slack or email at email@example.com.
Opinions expressed by DZone contributors are their own.