Data Orchestration Platform Powers Multi-Cloud Analytics and AI
Data Orchestration Platform Powers Multi-Cloud Analytics and AI
Additional capabilities simplify and accelerate multi-cloud, data-intensive workload adoption and deployment.
Join the DZone community and get the full member experience.Join For Free
I had the opportunity to speak with Dipti Borkar, V.P. Product Management and Marketing at Alluxio, to discuss the introduction of Alluxio 2.0 with innovations for data engineers managing and deploying analytical and AI workloads in the cloud, particularly for hybrid and multi-cloud environments.
The rise of compute-intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads — one in which compute scales independently from storage. While this enables scaling elasticity, it introduces new data engineering problems — this is where an abstraction layer is needed. Just as compute and containers need Kubernetes, with data silos only increasing, data also needs orchestration — a tier, one that brings data locality, data accessibility and data elasticity to compute across data silos, zones, regions and even clouds.
According to Dipti, developers will like the ability to: 1) pull from different data sources using REST APIs; 2) manage a lot more files from different locations; and, 3) easily have all policies defined at a folder or bucket level and synchronize data so it's consistent and fresh.
Alluxio 2.0 Community Edition and Enterprise Edition includes new capabilities across critical areas that are gaps in today’s cloud data engineering market.
Breakthrough Data Orchestration Innovation for Multi-Cloud
Policy-Driven Data Management
Alluxio 2.0 includes a new capability that allows data engineers to automate data movement across storage systems based on pre-defined policies on an automated and on-going basis. This means that as data is created and hot, warm, and cold data is managed, Alluxio can automate tiering of data across any number of storage systems across on-premises and across all clouds.
Data platform teams can now reduce storage costs by automatically managing only the most important data in expensive storage systems and moving other data to cheaper storage alternatives.
Improved Administration of Data Access Policies
In addition to fine-grained policies at the file level, now users can configure policies at any directory and folder level to streamline access of data as well as the performance of workloads. These include defining behaviors for individual datasets on various core functions like writing data or syncing data with storage systems under Alluxio.
Cross-Cloud Storage Efficient Data Movement via Data Service
The new data service allows for highly efficient data movement including across cloud stores like AWS S3 andGoogle GCS, making expensive operations on object storage seamless to the compute framework.
Compute Optimized Data Access for Cloud Analytics
Compute-Focused Cluster Partitioning
Users can now partition a single Alluxio based on any dimension so that data sets for each framework or workload aren't contaminated by the others. Most common usage includes partitioning the cluster by frameworks like Spark, Presto, etc. In addition, this allows for reduced data transfer costs, constraining data to stay within a specific zone or region.
Integration With External Data Sources Over REST
Users can now bring in data even from web-based data sources to aggregate in Alluxio to perform their analytics. Any web location with files can be simplify pointed to Alluxio to be pulled in as needed based on the query or model run.
Amazon AWS Support
AWS Elastic Map Reduce (EMR) Service Integration
As users move to cloud services to deploy analytical and AI workloads, services like AWS EMR are increasingly used. Alluxio can now be seamlessly bootstrapped into an AWS EMR cluster making it available as a data layer within EMR for Spark, Presto, and Hive frameworks. Users now have a high-performance alternative to cache data from S3 or remote data while also reducing data copies maintained in EMR.
Architectural Foundations Using Open Source
Many core foundational elements have been re-architected using the best open source technologies with a vision of hyper-scale.
RocksDB is now used for tiering metadata of files and objects for data that Alluxio manages to enable hyper-scale.
GRPC – Google’s highly efficient version of RPC is now the core transport protocol used for communication within the cluster as well as between the Alluxio client and master, making communications more efficient.
“Whether by design or by departmental necessity, companies are facing an explosion of data that is spread across hybrid and multi-cloud environments. To maintain a competitive advantage, speed and depth of insight have become the requirement,” said Steven Mih, CEO, Alluxio. “Data-driven analytics that was once ran over many hours, now need to be done in seconds. AI/ML models need to be trained against larger-and-larger data sets. This all points to the necessity of a data tier which orchestrates the movement and policy-driven access of a companies' data, wherever it may be stored. Alluxio abstracts the storage and enables a self-service culture within today's data-driven company."
Other Features Include
Highly Distributed Data Services - 2.0 introduces the Alluxio Data Service, a distributed clustered service, that data operations such as replication, persistence, for enabling high performance and massive scale.
Adaptive Replication for Increased Data Locality - New feature to configure a range for the number of copies of data stored in Alluxio that are automatically managed.
High Availability with Embedded Journal - A new fault tolerance and high availability mode for file and object metadata called the embedded journal that uses the RAFT consensus algorithm and is independent of any other external storage systems. This is particularly helpful for abstracting object storage.
Alluxio POSIX API - Alluxio’s FUSE feature enables a POSIX compatible API so that frameworks like Tensorflow, Caffe and other Python-based models can directly access data from any storage system via Alluxio using traditional file system access.
Opinions expressed by DZone contributors are their own.