DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Databases
  4. Tutorial: How to Run SQL Queries With Presto on Amazon Redshift

Tutorial: How to Run SQL Queries With Presto on Amazon Redshift

Learn how to run SQL queries with Presto on Amazon Redshift to get analytics on the Redshift data warehouse.

Rohan Pednekar user avatar by
Rohan Pednekar
·
Aug. 23, 21 · Tutorial
Like (7)
Save
Tweet
Share
5.80K Views

Join the DZone community and get the full member experience.

Join For Free

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial will show how to run SQL queries with Presto (running on Kubernetes) on AWS Redshift.

Presto's Redshift connector allows querying the data stored in an external Amazon Redshift cluster. This can be used to join data between different systems like Redshift and Hive, or between two different Redshift clusters. 

Step 1: Setup a Presto Cluster With Kubernetes 

Set up your own Presto cluster on Kubernetes using these instructions or you can use Ahana’s managed service for Presto. 

Step 2: Setup an Amazon Redshift Cluster

Create an Amazon Redshift cluster from AWS Console and make sure it’s up and running with the dataset and tables.

Below screen shows Amazon Redshift cluster – “redshift-presto-demo” 

Amazon Redshift cluster

Further, JDBC URL from Cluster is required to set up a redshift connector with Presto.

You can skip this section if you want to use your existing Redshift cluster, just make sure your redshift cluster is accessible from Presto because AWS services are secure by default. So even if you have created your Amazon Redshift cluster in a public VPC, the security group assigned to the target Redshift cluster can prevent inbound connections to the database cluster. In simple words, the Security Group settings of the Redshift database play a role of a firewall and prevent inbound database connections over port 5439. Find the assigned Security Group and check its Inbound rules. 

If your Presto Compute Plane VPC and data sources are in a different VPC then you need to configure a VPC peering connection.

Step 3: Configure Presto Catalog for Amazon Redshift Connector

Essentially, to configure the Redshift connector, create a catalog properties file in, etc/catalog named, for example, redshift.properties, to mount the Redshift connector as the redshift catalog. Create the file with the following contents, replacing the connection properties as appropriate for your setup:

connection-password=secret
connector.name=redshift
connection-url=jdbc:postgresql://example.net:5439/database
connection-user=root


This is how my catalog properties look:

 my_redshift.properties: |
      connector.name=redshift   
      connection-user=awsuser
      connection-password=admin1234 
connection-url=jdbc:postgresql://redshift-presto-demo.us.redshift.amazonaws.com:5439/dev


Step 4: Check for available datasets, schemas, and tables, etc and run SQL queries with Presto Client to access the Redshift database

After successfully database connection with Amazon Redshift, You can connect to Presto CLI and run the following queries and make sure that the Redshift catalog gets picked up and perform show schemas and show tables to understand available data. 

$./presto-cli.jar --server https://<presto.cluster.url> --catalog bigquery --schema <schema_name> --user <presto_username> --password


IN the below example you can see a new catalog for Redshift Database got initiated called “my_redshift. ”

presto> show catalogs;
   Catalog   
-------------
 ahana_hive  
 jmx         
 my_redshift 
 system      
 tpcds       
 tpch        
(6 rows)
 
Query 20210810_173543_00209_krtkp, FINISHED, 2 nodes
Splits: 36 total, 36 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]


Further,  you can check all available schemas for your Amazon Redshift from Presto to work with.

presto> show schemas from my_redshift;
       Schema       
--------------------
 catalog_history    
 information_schema 
 pg_catalog         
 pg_internal        
 public             
(5 rows)
 
Query 20210810_174048_00210_krtkp, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0:01 [5 rows, 85B] [4 rows/s, 72B/s]


Here, I have used sample data that comes with the Redshift Cluster setup. I have chosen the schema “public” which is a part of the “dev” Redshift Database. 

presto> show tables from my_redshift.public;
  Table   
----------
 category 
 date     
 event    
 listing  
 sales    
 users    
 venue    
(7 rows)
 
Query 20210810_185448_00211_krtkp, FINISHED, 3 nodes
Splits: 36 total, 36 done (100.00%)
0:03 [7 rows, 151B] [2 rows/s, 56B/s]


Further, you can explore tables as “sales” in the below example.

presto> select * from my_redshift.public.sales LIMIT 2;
 salesid | listid | sellerid | buyerid | eventid | dateid | qtysold | pricepaid | commission |        saletime         
---------+--------+----------+---------+---------+--------+---------+-----------+------------+-------------------------
   33095 |  36572 |    30047 |     660 |    2903 |   1827 |       2 | 234.00    | 35.10      | 2008-01-01 01:41:06.000 
   88268 | 100813 |    45818 |     698 |    8649 |   1827 |       4 | 836.00    | 125.40     | 2007-12-31 23:26:20.000 
(2 rows)
 
Query 20210810_185527_00212_krtkp, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:03 [18.1K rows, 0B] [6.58K rows/s, 0B/s]


Following are some more complex queries you can run against sample data:

presto:public> -- Find top 10 buyers by quantity
            ->SELECT firstname, lastname, total_quantity 
            -> FROM   (SELECT buyerid, sum(qtysold) total_quantity
            ->         FROM  sales
            ->         GROUP BY buyerid
            ->         ORDER BY total_quantity desc limit 10) Q, users
            -> WHERE Q.buyerid = userid
            -> ORDER BY Q.total_quantity desc;
 firstname | lastname | total_quantity 
-----------+----------+----------------
 Jerry     | Nichols  |             67 
 Armando   | Lopez    |             64 
 Kameko    | Bowman   |             64 
 Kellie    | Savage   |             63 
 Belle     | Foreman  |             60 
 Penelope  | Merritt  |             60 
 Kadeem    | Blair    |             60 
 Rhona     | Sweet    |             60 
 Deborah   | Barber   |             60 
 Herrod    | Sparks   |             60 
(10 rows)
 
Query 20210810_185909_00217_krtkp, FINISHED, 2 nodes
Splits: 214 total, 214 done (100.00%)
0:10 [222K rows, 0B] [22.4K rows/s, 0B/s]
 
presto:public> -- Find events in the 99.9 percentile in terms of all time gross sales.
            -> SELECT eventname, total_price 
            -> FROM  (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) as percentile 
            ->        FROM (SELECT eventid, sum(pricepaid) total_price
            ->              FROM   sales
            ->              GROUP BY eventid)) Q, event E
            ->        WHERE Q.eventid = E.eventid
            ->        AND percentile = 1
            -> ORDER BY total_price desc;
      eventname       | total_price 
----------------------+-------------
 Adriana Lecouvreur   | 51846.00    
 Janet Jackson        | 51049.00    
 Phantom of the Opera | 50301.00    
 The Little Mermaid   | 49956.00    
 Citizen Cope         | 49823.00    
 Sevendust            | 48020.00    
 Electra              | 47883.00    
 Mary Poppins         | 46780.00    
 Live                 | 46661.00    
(9 rows)
 
Query 20210810_185945_00218_krtkp, FINISHED, 2 nodes
Splits: 230 total, 230 done (100.00%)
0:12 [181K rows, 0B] [15.6K rows/s, 0B/s]


Step 5: Run SQL query to join data between different systems like Redshift and Hive

Another great use case of Presto is Data Federation. In this example, I will join the Apache Hive table with the Amazon Redshift table and run the JOIN query to access both tables from Presto. 

Here, I have two catalogs “ahana_hive” for Hive Database and “my_redshift” for Amazon Redshift and each database has my_redshift.public.users

 and ahana_hive.default.customer table respectively within their schema.

Following the very simple query to join these tables, the same way you join two tables from the same database.

presto> show catalogs;
presto> select * from ahana_hive.default.customer;
presto> select * from my_redshift.public.users;
presto> Select * from ahana_hive.default.customer x  join my_redshift.public.users y on x.nationkey = y.userid;


And now you should be able to query Amazon Redshift using Presto!

Database connection Redshift (theory) Amazon Redshift Presto (SQL query engine) cluster sql

Published at DZone with permission of Rohan Pednekar. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Browser Engines: The Crux of Cross-Browser Compatibility
  • Spring Boot vs Eclipse MicroProfile: Resident Set Size (RSS) and Time to First Request (TFR) Comparative
  • Key Elements of Site Reliability Engineering (SRE)
  • ClickHouse: A Blazingly Fast DBMS With Full SQL Join Support

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: