DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Big Data
  4. An Introduction to Presto

An Introduction to Presto

Presto is an open-source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Pallavi Singh user avatar by
Pallavi Singh
·
Apr. 25, 17 · Tutorial
Like (6)
Save
Tweet
Share
18.68K Views

Join the DZone community and get the full member experience.

Join For Free

In today’s blog, I will be introducing you to a new open-source distributed SQL query engine, Presto. It is designed for running SQL queries over Big Data (petabytes of data). It was designed by the people at Facebook.

Quoting its formal definition:

“Presto is an open-source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.”

The motive behind the inception of Presto was to enable interactive analytics and approaches to the speed of commercial data warehouses with the power to scale size of organizations matching Facebook.

Presto is a distributed query engine that runs on a cluster of machines. A full setup includes a coordinator and multiple workers. Queries are submitted from a client such as the Presto CLI to the coordinator. The coordinator parses, analyzes, and plans the query execution, then distributes the processing to the workers.

presto-overview.png

Working with terabytes or petabytes of data, one is likely to use tools that interact with Hadoop and HDFS. Presto was designed as an alternative to tools that query HDFS using pipelines of MapReduce jobs such as Hive or Pig, but Presto is not limited to accessing HDFS. Presto can be, and has been, extended to operate over different kinds of data sources including traditional relational databases and other data sources such as Cassandra.

Capabilities of Presto

  • Allow querying over data where it is residing like Hive, Cassandra, relational databases, or even proprietary data stores.
  • Allowing a single Presto query to combine data from multiple sources.
  • Faster response time, breaking the myth that “having fast analytics using an expensive commercial solution or using a slow free solution that requires excessive hardware.”

Creditability

Facebook uses Presto daily to run more than 30,000 queries that, in total, scan over a petabyte each per day over several internal data stores, including their 300PB data warehouse.

Connectors in Presto

Presto supports pluggable connectors that provide data for queries. There are several pre-existent connectors, while Presto provides the ability to connect with custom connectors, as well. It supports the following connectors:

  • Hadoop/Hive (Apache Hadoop 1.x, Apache Hadoop 2.x, Cloudera CDH 4, Cloudera CDH 5).
  • Cassandra (Cassandra 2.x is required. This connector is completely independent of the Hive connector and only requires an existing Cassandra installation.).
  • TPC-H (The connector dynamically generates data that can be used for experimenting with Presto).

Before we go further with analyzing the tool for its features, it becomes equally important to know what it is not capable of. This helps in determining its use cases and usability.

What Presto Is Not

Presto is not a general-purpose relational database. It is not a replacement for databases like MySQL, PostgreSQL, or Oracle. Presto is not designed to handle Online Transaction Processing (OLTP)

Competitors vs. Presto

Presto continues to lead in BI-type queries, and Spark leads performance-wise in large analytics queries. Presto scales better than Hive and Spark for concurrent dashboard queries. Production enterprise BI user-bases may be on the order of hundreds or thousands of users. As such, support for concurrent query workloads is critical. Benchmarks show that Presto performed the best — that is, showed the least query degradation — as concurrent query workload increased and showed the best results in user concurrency testing.

Another advantage of Presto over Spark and Impala is that it can be ready in just a few minutes. Additionally, Presto works directly on files in S3, requiring no ETL transformations.

References

  • Presto documentation
  • Big Data Faceoff: Spark vs. Impala vs. Hive vs. Presto
  • New BI Performance Benchmark Reveals Strong Innovation Among Open-Source Projects
  • Impala vs. Spark vs. Presto
Presto (SQL query engine) Relational database Big data hadoop Open source Connector (mathematics)

Published at DZone with permission of Pallavi Singh. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Building a RESTful API With AWS Lambda and Express
  • Using GPT-3 in Our Applications
  • OpenVPN With Radius and Multi-Factor Authentication
  • Public Key and Private Key Pairs: Know the Technical Difference

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: