Presto and Open Analytics
In this article, see why you should be deploying an open analytics stack.
Join the DZone community and get the full member experience.Join For Free
So you have multiple data sources and most of your data has ended up in the cloud, or will soon. It’s a mix of structured and unstructured, static and streaming, in many different formats, and often fractured across data warehouses, open source databases, proprietary databases, data lakes, document stores and object storage like Amazon S3. How best to unify access to that data and share it with internal and external applications and users to support the ever-widening variety of analytical and operational use cases?
It sounds challenging, but there are several solutions. One option is to try and consolidate all the data by moving it into a monolithic (or yet another) database, cloud warehouse or data lake. But this is often impractical, time-consuming, and likely to increase cost, effort, and vendor lock-in. Let’s face it, adding another vendor to the mix is not high on the CIO’s to-do list.
Enter Presto and the Open Analytics Stack
A more successful approach that many data-driven companies like Facebook, Uber, Airbnb, Twitter, LinkedIn, and Netflix have deployed is a modern architecture built on open source Presto - the federated, extensible, distributed SQL query engine which is becoming the dominant standard for analytics. This architecture, built around open source PrestoDB, with open formats, open interfaces, and open cloud features, is being termed the Open Analytics stack.
Why has Open Analytics and Presto’s time come? Here are three of the most significant reasons:
- With Presto, all your data stays exactly where it is, often stored in open, standard formats. You do not re-ingest your data. There’s less (or no) data movement, no copies, no additional ETL. This is crucial as data is often so unwieldy you simply cannot move it around on a whim.
- Presto is perfect for cloud deployment, containerization, and elasticity. This fits with many organizations’ desire to retire their on-premises infrastructure and redeploy systems in the cloud as efficiently as possible, and benefit from variable cloud-based service levels based on their changing needs.
- Presto enables data scientists to run interactive SQL across multiple data sources to obtain the data they need faster, while accelerating the iterative nature of data science. Presto means they spend less time wrangling data.
Where did PrestoDB come from? It was designed at Facebook in 2012 to enable their data analysts to run interactive SQL on multi-petabyte data volumes, initially in Apache Hadoop. Prior to that, they tried using Apache Hive but suffered performance problems - its MapReduce execution engine was okay for batch processing but wasn’t cut out for interactive queries. The following year Facebook open sourced the code, and by 2014 multiple organisations had started using it, such as Netflix who deployed Presto for querying 10 petabytes of their data stored in AWS S3. Features, functions and connectors were steadily added by an active community of contributors including Facebook themselves and in 2019 Facebook donated the PrestoDB project to the Linux Foundation - already the parent of the Cloud Native Computing Foundation (CNCF) amongst others - meaning PrestoDB is completely open source under the Apache 2.0 license.
Today Presto is enjoying rapid adoption as the modern alternative to traditional tightly coupled data architecture, as companies large and small see the benefit of a scalable, distributed, loosely coupled, SQL-based, pluggable, disaggregated stack that enables querying across data repositories. Being disaggregated is a significant industry trend and a hallmark of Presto, but what does it mean? Put simply, it is the separation of compute and storage. In other words, PrestoDB fully abstracts the data sources it can connect to, and data stays where it is. PrestoDB works directly on files in S3 for example, requiring no ETL transformation. Presto is also a proven technology; to illustrate, Amazon Athena, a very commonly used big data service, is built on PrestoDB.
This federated, disaggregated stack is open: Open source (Apache 2.0), open formats (PrestoDB doesn’t use any proprietary formats), open interfaces (PrestoDB is ANSI SQL compatible, uses standard JDBC / ODBC drivers, etc.), and “open cloud” (PrestoDB is cloud agnostic, and because it runs as a query engine without native storage, it aligns perfectly with containers and can be run on any cloud).
Presto’s architecture is distributed and parallel in design which means it scales very efficiently to handle data of all sizes, from GBs to PBs. A Presto cluster uses a coordinator/worker server architecture:
PrestoDB together with the following 3 elements make up the Open Analytics stack:
- A Metadata Catalog - Maps the data in storage engines like HDFS and Amazon S3 to schemas and tables. The metadata catalog lives in a database such as MySQL or Postgres and is accessed via the Hive metastore service. PrestoDB also works with the AWS Glue catalog. Presto connectors are provided for whichever metadata catalog you choose.
- A Transaction Manager - Ingest of complex and large volumes of data can lead to inconsistency without some careful design choices. Native object storage (like S3 buckets) and some databases (like Amazon Redshift) don’t support upserts, aka single merge statements. Support for transactions, rollbacks, snapshots, schema enforcement, and ACID consistency is spotty. Data engineers may opt to load the deltas (changes) when large volumes are in play, but it can be hard to build and maintain this kind of functionality. To better ensure consistent and up-to-date data, a Transaction Manager can be deployed. Examples of transaction managers include Apache Hudi, Databricks Delta Lake, and Apache Iceberg.
- The Data Sources - data lakes in S3 and HDFS, and SQL / NoSQL data sources like RDS / MySQL, RDS / PostgreSQL, Elasticsearch, etc.
The Open Analytics stack is a non-proprietary architecture - the managers, catalogs, query engine and underlying storage engines all remain independent. This ecosystem is entirely vendor-neutral, depending on your choices, making it easy to avoid lock-in with any particular vendor’s technology. For many, this is a hugely important quality.
To get started with PrestoDB, you can download the Presto tarball or you can choose to use a Docker sandbox, or an AMI - the documentation will help you get started. There will be some configuration effort, particularly around the configuration of data sources and the required metastore. Or plump for zero-deployment effort by choosing a fully-managed Presto cloud service.
PrestoDB really shines for these five use cases:
- Interactive, ad hoc analytics
- Federated querying
- SQL-based ETL
- Reporting and dashboarding
- Data lake analytics
If you are interested in learning more about PrestoDB and its benefits as the defacto standard query engine, do check out its documentation.
Opinions expressed by DZone contributors are their own.