DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Google Cloud “GCP” native NixOS images build
  • Real-Time AI Inference at Scale Using Cloud Run, GPUs, and Vertex AI
  • How SaaS Architectures Break at Scale — and the Engineering Decisions That Prevent It
  • Scaling Cloud Data Automation: A Practical Guide to Open Table Formats

Trending

  • MuleSoft IDP: Enhancing Efficiency and Accuracy in Data Extraction
  • No More Cheap Claude: 4 First Principles of Token Economics in 2026
  • A Walk-Through of the DZone Article Editor
  • Introduction to Tactical DDD With Java: Steps to Build Semantic Code
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. How To Build a Self-Serve Data Architecture for Presto Across Clouds

How To Build a Self-Serve Data Architecture for Presto Across Clouds

This article highlights the synergy between two open-source projects and demonstrates how together they deliver a self-serve data architecture across clouds.

By 
Jasmine Wang user avatar
Jasmine Wang
·
Adit Madan user avatar
Adit Madan
·
Bin Fan user avatar
Bin Fan
·
Mar. 11, 22 · Analysis
Likes (2)
Comment
Save
Tweet
Share
3.2K Views

Join the DZone community and get the full member experience.

Join For Free

This article highlights the synergy between the two widely adopted open-source projects, Alluxio and Presto, and demonstrates how together they deliver a self-serve data architecture across clouds. 

What Makes an Architecture Self-Serve?

Condition 1: Evolution of the Data Platform Does Not Require Changes

All data platforms evolve over time, including the addition of a new data store, compute engine, or a new team that needs to access shared data. In either case, a data platform is self-serve if it does not require changes to accommodate evolution.

Condition 2: Isolation Across Teams

Business units don’t step on each other with a self-serve platform. When a new team is introduced, data access by one team should have no impact on the existing usage of the shared data infrastructure.

The combination of the above two offers agility, which oftentimes is more important than the cost of physical infrastructure.

Data Platform Considerations

Below, we introduce some considerations when designing a self-serve platform, and architectural patterns for simple solutions.

Consideration 1: Data Is Shared

  • Between Compute Frameworks: There are a large number of specialized compute engines. Each engine is better suited for a specific task, which means there is a need to share data between engines; for example, ETL in batch processing followed by Presto for interactive queries. 
  • Between Different Teams: For example, a team is responsible for the collection of operational data which is then consumed by multiple other business units.
  • Between Data Centers Across Regions and Cloud Providers: This offers the flexibility to choose the most optimal service across environments.

The solution for shared data is to have an abstraction layer across heterogeneous compute. Alluxio provides such an abstraction across clouds for seamless sharing of data between Presto and other compute engines regardless of the data store.

Alluxio Shared Data Platform

Consideration 2: Data Has Ownership Domains and Processing in Place Is Simple

  • Although replication provides isolation, governance becomes complex as the owner of data enforces strict policies about the consumption of data. 
  • Copies introduce redundancy, which is error-prone and has high resource requirements.

It may seem obvious that a solution is to not make copies of data, but what about performance when we don’t move data? This calls for a single abstraction layer that takes care of governance, performance, and movement of data across ownership domains. 

The architecture below shows Presto using the Alluxio layer for access to data regardless of the location.

Presto across a hybrid cloud

The above design can be broken down into a couple of simple cases:

  1. All within a single cloud or a datacenter
  2. Shared across multiple datacenters or a hybrid cloud

In all these cases, the separation of the CONSUMER from the PRODUCER of data is enabled by an abstraction layer that provides more than a simple cache. Advanced preloading and write capabilities guarantee SLAs even with the separation of data from compute.

Business units or ownership domains may span regions

Conclusion

With a self-serve data architecture across clouds, we construct a solution that stands the test of time as a data platform evolves. 

Data architecture Cloud Presto (SQL query engine) Build (game engine)

Published at DZone with permission of Jasmine Wang. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Google Cloud “GCP” native NixOS images build
  • Real-Time AI Inference at Scale Using Cloud Run, GPUs, and Vertex AI
  • How SaaS Architectures Break at Scale — and the Engineering Decisions That Prevent It
  • Scaling Cloud Data Automation: A Practical Guide to Open Table Formats

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook