To follow along with a PrestoDB instance with a local S3 storage running on docker, feel free to use https://github.com/njanakiev/presto-minio-docker, which covers a standalone docker setup.
Connectors
PrestoDB uses a variety of connectors to access data from different data sources. In this Refcard, you will see how to access data on HDFS, S3, and PostgreSQL. More on other connectors: https://prestodb.io/docs/current/connector.html.
TPC-H and TPC-DS Connectors
TPCH (http://www.tpc.org/tpch/) and TPCDS (http://www.tpc.org/tpcds/) connectors provide a set of schemas for the TPC Benchmark™ to measure the performance of complex decision support databases. They are generally popular when running benchmarks to evaluate big data systems.
To add the tpcds
connector, create the file, etc/catalog/tpcds.properties
:
For the tpch
connector, create the file, etc/catalog/tpch.properties
:
Those connectors generate the data automatically and offer various schemas with a different number of rows for each table — and are generally great for testing queries. TPCH will be used in some of the example queries later in the Refcard.
More on these connectors:
Hive Connector
For the Hive Connector, you need a Hive metastore running to connect either to Hadoop HDFS or an S3 storage, which is beyond the scope of this Refcard (see Additional Resources to learn more). When using HDFS, you need to configure tables with Hive. With S3, you can also specify tables for certain file types like Parquet with PrestoDB alone without creating an additional table in Hive.
The following file types are supported for the Hive Connector:
- ORC
- Parquet
- Avro
- RCFile
- SequenceFile
- JSON
- Text
Create the file, etc/cataloc/hive.properties
:
Replace S3_ENDPOINT
, S3_ACCESS_KEY
, and S3_SECRET_KEY
with the values you specified in the metastore-site.xml
or hive-site.xml
configuration, as they must be identical. Then, change the HIVE_METASTORE_ENDPOINT
to the endpoint of the hive metastore.
Enable S3 Select Pushdown to allow predicate pushdown by adding the property, hive.s3select-pushdown.enabled=true
. If you are using multiple Hive clusters, you can add those by creating new configuration files in etc/catalog
for each endpoint. This enables you to mix HDFS, AWS Glue, other S3 external object stores, and even local S3 object stores.
Create a schema to a specific bucket:
Note that this bucket must exist beforehand. You can create it in the S3 UI or with a tool like s3cmd
(https://s3tools.org/s3cmd) by running:
Now, create a new Hive table:
Query the table:
Drop the metadata of the table (the data itself is not deleted):
Drop a schema:
More on configuration: https://prestodb.io/docs/current/connector/hive.html
PostgreSQL Connector
PostgreSQL is a popular and highly advanced, open-source RDBMS that can often be suitable for various use cases. This PrestoDB connector allows you to query and create tables in an external PostgreSQL database.
The configuration for this connector must be added to the file, etc/catalog/postgresql.properties
:
To see all schemas in PostgreSQL:
To see all tables in a schema in PostgreSQL:
Further, to describe the columns in a table:
To query this table:
An example of how to create a table from another table in PostgreSQL:
Note that CREATE TABLE
by itself without AS
is not supported. The same goes for these SQL statements:
DELETE
ALTER TABLE
GRANT
/REVOKE
SHOW GRANTS
/SHOW ROLES
/SHOW ROLE GRANTS
More information: https://prestodb.io/docs/current/connector/postgresql.html.
Functions and Operators
Explore catalogs, schemas, and tables with:
SHOW CATALOGS;
|
List all catalogs |
SHOW SCHEMAS IN catalog_name;
|
List all schemas in a catalog |
SHOW TABLES IN catalog_name.schema_name;
|
List all tables in a schema |
DESCRIBE catalog_name.schema_name.table_name;
|
List the columns in a table along with their data type and other attributes (alias for SHOW COLUMNS FROM table ) |
PrestoDB offers a wide variety of functions and operators. You can apply logical operators with AND
, OR
, and NOT
, as well as comparisons with <
, >
, and =
, among others. There is a large set of commonly used mathematical functions and operators like abs(x)
, ceiling(x)
, floor(x)
, sqrt(x)
, sin(x)
, cos(x)
, tan(x)
, and random()
, among many others. For more functions: https://prestodb.io/docs/current/functions/math.html.
The same goes for common string and date functions and operators that you might want to use. Find the full list at https://prestodb.io/docs/current/functions/string.html and https://prestodb.io/docs/current/functions/datetime.html, respectively.
When working with analytical queries, a common use case is to run aggregation over groups or whole columns. PrestoDB has many such functions that cover almost any use case. The following simple example returns the average account balance for each marketsegment
in the TPCH data set:
Common aggregation functions are count()
, avg()
, sum(x)
, min(x)
, max(x)
, and stddev(x)
, among many more advanced aggregations and statistical functions. For more: https://prestodb.io/docs/current/functions/aggregate.html
PrestoDB also offers the well-known functionality for window functions using the OVER
clause. A window function uses values from one or multiple rows in a table to return a value for each row. A common example query for a rolling average over seven days:
Another example is to calculate a sum over order priority for each row:
Read more about window functions: https://prestodb.io/docs/current/functions/window.html
Also, the PostGIS extension includes a large set of geospatial functions and aggregations that should be familiar to geospatial professionals coming from PostgreSQL. An example of how to aggregate data for a specific region:
A full list for the geospatial functionality: https://prestodb.io/docs/current/functions/geospatial.html
PrestoDB is quite extensive in its functionalities, and this section should serve as a quick overview of what is available. Other functions and operators include:
- Bitwise Functions
- Binary Functions and Operators
- Regular Expression Functions
- JSON Functions and Operators
- URL, IP, and HyperLogLog Functions
- Lambda Expressions
- Array and Map Functions and Operators
The complete list of functions and operators: https://prestodb.io/docs/current/functions.html
Query Optimization
For your query to be executed on the various data sources, it requires a few steps from the initial SQL statement to the resulting query plan — where the query optimizer comes into play. After parsing the SQL statement into a syntax tree and later into a logical plan, the query optimizer takes care in creating an efficient execution strategy chosen among many possible strategies.
PrestoDB uses two optimizers. The Rule-Based Optimizer (RBO) applies filters to prune irrelevant data and uses hash joins to avoid full cartesian joins. This includes strategies such as predicate pushdown, limit pushdown, column pruning, and decorrelation. Next, it uses a Cost-Based Optimizer (CBO) continuing from the previous optimization. Here it uses statistics of the table (e.g., number of distinct values, number of null values, distributions of column data) to optimize queries and reduce I/O and network overhead.
You can see available statistics in your tables using these commands:
SHOW STATS FOR table_name;
|
Approximated statistics for the named table |
SHOW STATS FOR ( SELECT ... );
|
Approximated statistics for the query result |
To see the cost-based analysis of a query, you can use the EXPLAIN and the EXPLAIN ANALYZE keywords:
EXPLAIN [VERBOSE] SELECT ...
|
Execute statement and show the distributed execution plan with the cost of each operation. |
EXPLAIN ANALYZE [VERBOSE] SELECT ...
|
Execute statement and show the distributed execution plan with the cost and duration of each operation. |
An example of using EXPLAIN
on a simple SELECT
statement with the LIMIT
clause, showing the calculated cost and expected number of rows:
The same query with EXPLAIN ANALYZE
, showing the distributed execution plan, including the duration and cost for each stage:
You can also add the VERBOSE
option to get more detailed information and low-level statistics. For more on cost in EXPLAIN
and cost-based optimizations, visit https://prestodb.io/docs/current/optimizer/cost-in-explain.html and https://prestodb.io/docs/current/optimizer/cost-based-optimizations.html, respectively.