DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • 90% Cost Reduction With Prefix Caching for LLMs
  • iRODS: An Open-Source Approach to Data Management in Large-Scale Research Environments
  • Building the Next-Generation Data Lakehouse: 10X Performance
  • Providing Enum Consistency Between Application and Data

Trending

  • Agentic AI for Automated Application Security and Vulnerability Management
  • Doris: Unifying SQL Dialects for a Seamless Data Query Ecosystem
  • Stateless vs Stateful Stream Processing With Kafka Streams and Apache Flink
  • Issue and Present Verifiable Credentials With Spring Boot and Android
  1. DZone
  2. Data Engineering
  3. Data
  4. Configuring RaptorX Multi-Level Caching With Presto

Configuring RaptorX Multi-Level Caching With Presto

In this tutorial, we'll show you how to get multi-level caching for Presto with RaptorX, Meta's multi-level cache built for performance.

By 
Rohan Pednekar user avatar
Rohan Pednekar
·
Sep. 29, 23 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
3.6K Views

Join the DZone community and get the full member experience.

Join For Free

RaptorX and Presto: Background and Context

Meta introduced a multi-level cache at PrestoCon 2021, the open-source community conference for Presto. Code-named the “RaptorX Project,” it aims to make Presto 10x faster on Meta-scale petabyte workloads. This is a unique and very powerful feature only available in PrestoDB and not any other versions or forks of the Presto project.

Presto is the open-source SQL query engine for data analytics and the data lakehouse. It enables you to scale independently and reduce costs. However, storage-compute disaggregation also brings new challenges for query latency as scanning huge amounts of data between the storage tier and the compute tier is going to be IO-bound over the network. As with any database, optimized I/O is a critical concern to Presto. When possible, the priority is to not perform any I/O at all. This means that memory utilization and caching structures are of utmost importance.

Let’s understand the normal workflow of how the Presto-Hive connector works:

  1. During a read operation, the planner sends a request to the metastore for metadata (partition info)
  2. The scheduler sends requests to remote storage to get a list of files and does the scheduling
  3. On the worker node, first, it receives the list of files from the scheduler and sends a request to remote storage to open a file and read the file footers
  4. Based on the footer, Presto understands what the data blocks or chucks we need to read from remote storage
  5. Once workers read them, Presto performs computation on the leaf worker nodes based on join or aggregation and does the shuffle back to send query results to the client.

These are a lot of RPC calls not just for the Hive Metastore to get the partition information but also for the remote storage to list files, schedule those files, open files, and then retrieve and read those data files from remote storage. Each of these IO paths for Hive connectors is a bottleneck on query performance, and this is the reason we build multi-layer cache intelligently so that you can max cache hit rate and boost your query performance.

RaptorX introduces a total of five types of caches and a scheduler. This cache system is only applicable to Hive connectors.

Multi-layer Cache Type Affinity Scheduling Benefits
Data IO  Local Disk Required Reduced query latency
Intermediate Result Set Local Disk Required Reduced query latency and CPU utilization for aggregation queries 
File Metadata In-memory Required Reduced CPU & latency decrease
Metastore  In-memory NA Reduced query latency
File List In-memory NA Reduced query latency
Table: Summary of Presto Multi-Layer Cache Implementation

We'll explain how you can configure and test various layers of RaptorX cache in your Presto cluster.

#1 Data(IO) Cache

This cache makes use of a library that is built using the alluxio LocalCacheFileSystem, which is an implementation of the HDFS interface. The alluxio data cache is the worker node local disk cache that stores the data read from the files(ORC, Parquet, etc.) on the remote storage. The default page size on disk is 1MB. Uses LRU policy for evictions, and in order to enable this cache, we require local disks. 

To enable this cache, worker configuration needs to be updated with the below properties at

TypeScript
 
etc/catalog/<catalog-name>.properties 

cache.enabled=true 
cache.type=ALLUXIO 
cache.alluxio.max-cache-size=150GB — This can be set based on the requirement. 
cache.base-directory=file:///mnt/disk1/cache


Also, add below Alluxio property to coordinator and worker etc/jvm.config to emit all metrics related to Alluxio cache -Dalluxio.user.app.id=presto

#2 Fragment Result Set Cache

This is nothing but an intermediate reset set cache that lets you cache partially computed results set on the worker’s local SSD drive. This is to prevent duplicated computation upon multiple queries, which will improve your query performance and decrease CPU usage. 

Add the following properties under the /config.properties

TypeScript
 
fragment-result-cache.enabled=true 
fragment-result-cache.max-cached-entries=1000000 
fragment-result-cache.base-directory=file:///data/presto-cache/2/fragmentcache 
fragment-result-cache.cache-ttl=24h


#3 Metastore cache

A Presto coordinator caches table metadata (schema, partition list, and partition info) to avoid long getPartitions calls to metastore. This cache is versioned to confirm the validity of cached metadata.

In order to enable metadata cache, set the below properties at /<catalog-name>.properties 

TypeScript
 
hive.metastore-cache-scope=PARTITION
hive.metastore-cache-ttl=2d
hive.metastore-refresh-interval=3d
hive.metastore-cache-maximum-size=10000000


#4 File List cache

A Presto coordinator caches file lists from the remote storage partition directory to avoid long listFile calls to remote storage. This is a coordinator in-memory cache.

Enable file list cache by setting the below set of properties at

/catalog/<catalog-name>.properties 

TypeScript
 
# List file cache
hive.file-status-cache-expire-time=24h 
hive.file-status-cache-size=100000000 
hive.file-status-cache-tables=*


#5 File Metadata Cache

Caches open file descriptors and stripe/file footer information in worker memory. These pieces of data are most frequently accessed when reading files. This cache is not just useful for decreasing query latency but also for reducing CPU utilization.

This is in the memory cache and suitable for ORC and Parquet file formats.

For ORC, it includes file tail (postscript, file footer, file metadata), stripe footer, and stripe stream (row indexes/bloom filters).

For Parquet, it caches the file and block-level metadata.

In order to enable metadata cache, set the below properties at /<catalog-name>.properties 

TypeScript
 
# For ORC metadata cache: <catalog-name>.orc.file-tail-cache-enabled=true 
<catalog-name>.orc.file-tail-cache-size=100MB 
<catalog-name>.orc.file-tail-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-metadata-cache-enabled=true 
<catalog-name>.orc.stripe-footer-cache-size=100MB 
<catalog-name>.orc.stripe-footer-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-stream-cache-size=300MB 
<catalog-name>.orc.stripe-stream-cache-ttl-since-last-access=6h 

# For Parquet metadata cache: 
<catalog-name>.parquet.metadata-cache-enabled=true 
<catalog-name>.parquet.metadata-cache-size=100MB 
<catalog-name>.parquet.metadata-cache-ttl-since-last-access=6h


The <catalog-name> in the above configuration should be replaced by the catalog name that you are setting these in. For example, If the catalog properties file name is ahana_hive.properties then it should be replaced with “ahana_hive”. 

#6 Affinity Scheduler

With affinity scheduling, the Presto Coordinator schedules requests that process certain data/files to the same Presto worker node to maximize the cache hits. Sending requests for the same data consistently to the same worker node means fewer remote calls to retrieve data.

Data caching is not supported with random node scheduling. Hence, this is a must-have property that needs to be enabled in order to make RaptorX Data IO, Fragment result cache, and File metadata cache work. 

In order to enable the affinity scheduler set the below property at /catalog.properties

hive.node-selection-strategy=SOFT_AFFINITY

How Can You Test or Debug Your RaptorX Cache Setup With JMX Metrics?

Each section describes queries to be run and queries the JMX metrics to verify the cache usage.

Note: If your catalog is not named ‘ahana_hive,’ you will need to change the table names to verify the cache usage. Substitute ahana_hive with your catalog name.

Data IO Cache

Queries to trigger Data IO cache usage

TypeScript
 
USE ahana_hive.default; 
SELECT count(*) from customer_orc group by nationkey; 
SELECT count(*) from customer_orc group by nationkey;


Queries To Verify Data IO Data Cache Usage

-- Cache hit rate.
SELECT * from 
jmx.current."com.facebook.alluxio:name=client.cachehitrate.presto,type=gauges";

-- Bytes read from the cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesreadcache.presto,type=meters";

-- Bytes requested from cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesrequestedexternal.presto,type=meters";

-- Bytes written to cache on each node.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheBytesWrittenCache.presto,type=meters";

-- The number of cache pages(of size 1MB) currently on disk
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CachePages.presto,type=counters";

-- The amount of cache space available.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheSpaceAvailable.presto,type=gauges";

-- There are many other metrics tables that you can view using the below command.
SHOW TABLES FROM 
jmx.current like '%alluxio%';


Fragment Result Cache

An example of the query plan fragment that is eligible for having its results cached is shown below.

Fragment 1 [SOURCE] 
Output layout: [count_3] Output partitioning: SINGLE [] Stage Execution 
Strategy: UNGROUPED_EXECUTION 
- Aggregate(PARTIAL) => [count_3:bigint] count_3 := "presto.default.count"(*) 
- TableScan[TableHandle {connectorId='hive', 
connectorHandle='HiveTableHandle{schemaName=default, tableName=customer_orc, 
analyzePartitionValues=Optional.empty}', 
layout='Optional[default.customer_orc{}]'}, gr Estimates: {rows: 150000 (0B), 
cpu: 0.00, memory: 0.00, network: 0.00} LAYOUT: default.customer_orc{}


Queries To Trigger Fragment Result Cache Usage

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;


Query Fragment Set Result Cache JMX Metrics

-- All Fragment result set cache metrics like cachehit, cache entries, size, etc 
SELECT * FROM 
jmx.current."com.facebook.presto.operator:name=fragmentcachestats";


ORC Metadata Cache

Queries to trigger ORC cache usage:

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;


Query ORC Metadata cache JMX metrics:

-- File tail cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_orcfiletail,type=cachestatsmbean";

 -- Stripe footer cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripefooter,type=cachestatsmbean"; 

-- Stripe stream(Row index) cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripestream,type=cachestatsmbean";


Parquet Metadata Cache

Queries to trigger Parquet metadata cache:

SELECT count(*) from customer_parquet; 
SELECT count(*) from customer_parquet;


Query Parquet Metadata cache JMX metrics:

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_parquetmetadata,type=cachestatsmbean";


File List Cache

Query File List cache JMX metrics:

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive,type=cachingdirectorylister";


This should help you get started with RaptorX multi-level caching with Presto. If you have specific questions about your Presto deployment, head over to the Presto open-source community Slack, where you can get help.

Metadata Open source Cache (computing) Presto (SQL query engine)

Published at DZone with permission of Rohan Pednekar. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • 90% Cost Reduction With Prefix Caching for LLMs
  • iRODS: An Open-Source Approach to Data Management in Large-Scale Research Environments
  • Building the Next-Generation Data Lakehouse: 10X Performance
  • Providing Enum Consistency Between Application and Data

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!