Running Presto on Kubernetes With Ahana Cloud and AWS EKS
Data platform teams are increasingly using open source SQL engine PrestoDB for analytics. Here's how you can get started with PrestoDB on K8s and AWS.
Join the DZone community and get the full member experience.
Join For FreeThe need for data engineers and analysts to run interactive, ad hoc analytics on large amounts of data continues to grow exponentially. Data platform teams are increasingly using PrestoDB, a federated SQL query engine, to run such analytics across a wide range of data lakes and databases, in-place, without the need to move data.
In this post, we will explore the following:
The requirements that companies have for self-service ad hoc analytics on data stored in AWS
How Presto, an open source distributed SQL engine, answers many of these requirements
How Ahana Cloud, a Presto-as-a-service, built for AWS using Amazon EKS, ECS, and many other Amazon services, enables platform teams to provide self-service analytics for their teams effortlessly.
1. Self-Service SQL Analytics Requirements
As enterprises rely on more AWS services as well as purpose-built databases installed in the cloud, data and metadata are spread very widely. Platform teams have resorted to heavy data pipelining, moving data around across multiple platforms, and in some cases, even creating multiple copies of the data to make the data more accessible. In addition, self-service analytics requires platform engineers to integrate many business intelligence, reporting tools, data science notebooks, and data visualization tools across every data source used.
The obvious downsides are added latency—consumers need to wait longer for those data pipelines to complete and for their tools to be connected–and added costs, since duplicate data consumes additional storage and data movement burns compute cycles. All those platform engineering, management, and monitoring tasks add up. Given the complexity of these activities, platform teams are looking to simplify their approach, and we often see the following requirements from users:
Query data wherever it lives. While some level of data transformation, cleansing, and wrangling will always be required, users want to eliminate pipelines that simply move data around and remove unnecessary duplicates—with the added bonus of reducing data store proliferation.
An ANSI SQL engine that works with the widest possible range of tools and data sources, with pre-integrated connectors that are ready to use.
Ability to query any data in any form, including relational and non-relational sources and object stores, and in any file format, like JSON, Parquet, ORC, RCFile, CSV flat-files and others or Kafka streams.
Low-latency querying for ad-hoc analysis. With the increase in data-driven analysis for making every decision in the enterprise, users are looking for query results in seconds and minutes–not hours.
Ability to deal with data of any size, with practically unlimited scalability.
2. Presto, an Open Source Distributed SQL Engine
Presto is one of the fastest-growing open source projects in data analytics. It has a distributed engine with a highly flexible, pluggable architecture. This makes it a great technology to become the core engine for self-service analytics since it meets many of the requirements that enterprises have around self-service analytics.
Presto allows for in-place querying, meaning you don’t re-ingest your data into Presto - data stays where it is. In addition, the federated architecture means that you can access data from multiple sources, allowing for analytics across your entire organization. Presto is ANSI SQL compatible and provides standard connectivity with JDBC drivers used by popular BI tools as well as notebooks. In addition, it can query data in multiple formats, including data stores in Amazon S3 and data lakes in a variety of formats like JSON, Apache Parquet, Apache ORC, etc. Presto is an in-memory engine and, unlike Apache Hive and other batch engines, providing low-latency querying.
Figure 1 Presto, an open source distributed SQL engine
A Presto deployment has one coordinator and multiple workers. The Presto coordinator is the server that is responsible for parsing statements, planning queries, taking care of security and workload management, and distributing processing to Presto worker nodes. It is the “brain” of a Presto deployment and is also the node to which a client connects to submit queries for execution. Every Presto installation must have a Presto coordinator alongside one or more Presto workers.
A Presto worker is responsible for executing tasks and processing data. Worker nodes fetch data from the source via Presto’s connectors and exchange intermediate data with each other. The coordinator is responsible for fetching results from the workers and returning the final results to the client. Learn more in the docs.
Figure 2 Presto Architecture
Presto is still a complicated distributed data system with extensive configuration and integration required to get started, followed by rigorous management once deployed. This tends to make Presto deployments expensive and, while some companies–particularly large Internet ones–are able to do so, enabling self-service SQL analytics across many data sources, both data lakes and databases, has been largely unachievable.
The good news is that Presto’s fundamental design of running with storage and compute separated makes it a natural fit for running in containers like Amazon’s Kubernetes service as well as benefiting from the general flexibility and elasticity that AWS provides.
3. Introducing Ahana Cloud for Presto, Natively Built on Amazon EKS
Ahana Cloud for Presto is a cloud-native managed service built specifically for AWS that allows cloud and data platform teams to provide self-service, SQL analytics for their organization’s analysts and scientists. Ahana Cloud for Presto utilizes both containers using ECS as well as Amazon EKS and is easily procured with Pay-As-You-Go (PAYG) pricing in AWS Marketplace.
This industry-first Presto-as-a-Service offering is designed to simplify the deployment, management, and integration of Presto with data catalogs, databases, and data lakes on AWS. Ahana Cloud for Presto is deployed within your AWS account, giving you complete control by taking the compute to where the data lives. In addition, it gives you visibility into the environment and the clusters.
Ahana is a member of the Presto Foundation. Ahana’s mission is to simplify interactive analytics as well as foster growth of, and evangelize to grow, the Presto community.
Ahana Cloud Overview
Ahana Cloud for Presto simplifies and unifies data analytics so you can query data directly in place across a range of data sources on AWS, including Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS) for MySQL, Amazon RDS for PostgreSQL, as well as data catalogs like Hive Metastore and Amazon Glue—without the need to move or copy the data.
Ahana Cloud for Presto provides the following to get you going quickly with Presto:
- Easy-to-use Ahana Console in the control plane for creation, deployment, and management of a multi-cluster compute plane developed with the emerging best practice of an in-virtual private cloud (VPC) deployment on AWS
- Support for Amazon S3, Amazon RDS for MySQL, Amazon RDS for PostgreSQL, and others
- Built-in Ahana-managed Hive Metastore that manages metadata for data stored in Amazon S3 data lakes
- Support for user-managed Hive Metastores and Amazon Glue
- Range of security capabilities including fully protected compute plane,
- Cloud-native highly scalable and available containerized environment deployed on Amazon EKS
- Integration with any business intelligence and dashboarding tool or data science notebook
Figure 3 shows an overview of the Ahana Cloud Control Plane and Compute Plane
We have worked closely with Data Engineers who work on AWS like Cal Mitchell at TimeTrade. He has seen how Ahana Cloud for Presto on AWS saves hours and days across the configuration, deployment, and operations lifecycle. An easy to use solution, it allows him to innovate features instead of managing infrastructure. It provides him the best of both worlds, the benefits of Presto with minimal operational complexity.
Reference Architecture
The Ahana architecture includes two core components. The Ahana Control Plane and the Ahana Compute Plane.
Figure 4 shows the Ahana Cloud reference architecture
Ahana’s Control Plane runs in Ahana’s VPC and includes the Ahana SaaS console. You can create, deploy, resize, stop, start and terminate Presto clusters.
Ahana’s Compute Plane runs in the user’s VPC. Presto clusters are created here by the Ahana SaaS console using cross-account roles, AWS VPC and AWS EKS. Each compute plane has Apache Superset created, and optionally a Hive metastore catalog with an S3 bucket. All of these resources are containerized on Amazon EKS in the user’s AWS account.
Getting Started
Pre-requisites:
- AWS Account
- Sign up for Ahana Cloud for Presto
The Ahana SaaS console is a multi-tenant application that allows users to login using Amazon’s Single Sign-on service, Cognito, into the Ahana Management Console.
Once registered, users can deploy Ahana’s Compute Plane in their own account using Amazon’s recommended approach of cross-account access via external ID.
Once configured, Ahana’s Compute Plane is deployed once per tenant in-VPC on Amazon EKS. Amazon EKS is a fully managed Kubernetes service providing high availability, security, and is certified Kubernetes conformance. Kubernetes is an open source software that allows you to deploy and manage containerized applications at scale.
Users create and manage Presto clusters via the Ahana console running in the Control Plane, which then gets deployed in EKS. Each cluster is created within a node group to benefit from the powerful features EKS provides, in particular high availability. EKS runs the Kubernetes management infrastructure across multiple AWS Availability Zones, automatically detects and replaces unhealthy nodes, and provides on-demand, zero downtime upgrades and patching.
Each Presto cluster comes pre-integrated with a Hive Metastore to store metadata for schemas and tables generated via Presto as well as an Amazon S3 data lake, where data inserted into tables gets stored.
In addition to the pre-integrated catalog and Amazon S3 bucket, users can also attach external Hive Metastores or Amazon Glue catalogs pre-populated with metadata for structured data stored in Amazon S3 as well as databases running on Amazon RDS / MySQL or PostgreSQL.
Common Use Cases
Presto is used for a wide range of analytical use cases and is particularly good for interactive and ad hoc querying. The most common use cases include:
Ad hoc querying: Use SQL to run ad hoc queries whenever you want, wherever your data resides.
Reporting & dashboarding: Query data across multiple sources to build reports and dashboards for self-service business intelligence.
Transformation using SQL (ETL): Aggregate terabytes of data across multiple data sources and run efficient ETL queries against that data with Presto.
Data lake analytics: Query data directly on a data lake without the need for transformation.
Federated querying across multiple data sources: Query data across many different data sources including databases, data lakes, lake houses, on-premises or in the cloud.
A real world example is at Uber. They rely on Presto to provide real-time analytics on data for various offers - maybe it’s offering a $10 ride for $5 in a certain geographic region on a certain day or predicting fastest routes for riders. Presto gives them the scalability and performance to query the massive amounts of data stored in HDFS and provide real-time business decisions, offers, and more.
Conclusion
Different from a traditional data warehouse that has vertically integrated storage and compute, Presto is the key compute portion, a SQL query execution engine. The advantage of this decoupled storage model is that Presto is able to provide a single federated view of all of your data, no matter where it resides.
Presto manages to provide its scalability and performance by splitting a query into multiple smaller operations that can be performed in parallel and allowing data to be redistributed across the cluster to perform joins, group-bys, and ordering of data sets.
As your data set grows, you can grow your Presto cluster in order to maintain the same expected runtimes. Such performance, combined with the flexibility to query virtually any data source, can help empower your business to get more value from your data than ever before—all while keeping the data where it is and avoiding expensive transfers and engineering time to consolidate your data into one place for analysis. Ahana Cloud for Presto takes care of deploying and managing your Presto clusters on AWS, including the tuning and integration required, and lets you get those insights faster.
You can watch the hands-on tutorial in the AWS & Ahana webinar.
Opinions expressed by DZone contributors are their own.
Comments