Configuring RaptorX Multi-Level Caching With Presto

In this tutorial, we'll show you how to get multi-level caching for Presto with RaptorX, Meta's multi-level cache built for performance.

Rohan Pednekar

Sep. 29, 23 · Tutorial

Likes (1)

Comment

Save

3.6K Views

RaptorX and Presto: Background and Context

Meta introduced a multi-level cache at PrestoCon 2021, the open-source community conference for Presto. Code-named the “RaptorX Project,” it aims to make Presto 10x faster on Meta-scale petabyte workloads. This is a unique and very powerful feature only available in PrestoDB and not any other versions or forks of the Presto project.

Presto is the open-source SQL query engine for data analytics and the data lakehouse. It enables you to scale independently and reduce costs. However, storage-compute disaggregation also brings new challenges for query latency as scanning huge amounts of data between the storage tier and the compute tier is going to be IO-bound over the network. As with any database, optimized I/O is a critical concern to Presto. When possible, the priority is to not perform any I/O at all. This means that memory utilization and caching structures are of utmost importance.

Let’s understand the normal workflow of how the Presto-Hive connector works:

During a read operation, the planner sends a request to the metastore for metadata (partition info)
The scheduler sends requests to remote storage to get a list of files and does the scheduling
On the worker node, first, it receives the list of files from the scheduler and sends a request to remote storage to open a file and read the file footers
Based on the footer, Presto understands what the data blocks or chucks we need to read from remote storage
Once workers read them, Presto performs computation on the leaf worker nodes based on join or aggregation and does the shuffle back to send query results to the client.

These are a lot of RPC calls not just for the Hive Metastore to get the partition information but also for the remote storage to list files, schedule those files, open files, and then retrieve and read those data files from remote storage. Each of these IO paths for Hive connectors is a bottleneck on query performance, and this is the reason we build multi-layer cache intelligently so that you can max cache hit rate and boost your query performance.

RaptorX introduces a total of five types of caches and a scheduler. This cache system is only applicable to Hive connectors.

Multi-layer Cache	Type	Affinity Scheduling	Benefits
Data IO	Local Disk	Required	Reduced query latency
Intermediate Result Set	Local Disk	Required	Reduced query latency and CPU utilization for aggregation queries
File Metadata	In-memory	Required	Reduced CPU & latency decrease
Metastore	In-memory	NA	Reduced query latency
File List	In-memory	NA	Reduced query latency

Table: Summary of Presto Multi-Layer Cache Implementation

We'll explain how you can configure and test various layers of RaptorX cache in your Presto cluster.

#1 Data(IO) Cache

This cache makes use of a library that is built using the alluxio LocalCacheFileSystem, which is an implementation of the HDFS interface. The alluxio data cache is the worker node local disk cache that stores the data read from the files(ORC, Parquet, etc.) on the remote storage. The default page size on disk is 1MB. Uses LRU policy for evictions, and in order to enable this cache, we require local disks.

To enable this cache, worker configuration needs to be updated with the below properties at

    TypeScript
   
   etc/catalog/<catalog-name>.properties 

cache.enabled=true 
cache.type=ALLUXIO 
cache.alluxio.max-cache-size=150GB — This can be set based on the requirement. 
cache.base-directory=file:///mnt/disk1/cache

Also, add below Alluxio property to coordinator and worker etc/jvm.config to emit all metrics related to Alluxio cache -Dalluxio.user.app.id=presto

#2 Fragment Result Set Cache

This is nothing but an intermediate reset set cache that lets you cache partially computed results set on the worker’s local SSD drive. This is to prevent duplicated computation upon multiple queries, which will improve your query performance and decrease CPU usage.

Add the following properties under the /config.properties

    TypeScript
   
   fragment-result-cache.enabled=true 
fragment-result-cache.max-cached-entries=1000000 
fragment-result-cache.base-directory=file:///data/presto-cache/2/fragmentcache 
fragment-result-cache.cache-ttl=24h

#3 Metastore cache

A Presto coordinator caches table metadata (schema, partition list, and partition info) to avoid long getPartitions calls to metastore. This cache is versioned to confirm the validity of cached metadata.

In order to enable metadata cache, set the below properties at /<catalog-name>.properties

    TypeScript
   
   hive.metastore-cache-scope=PARTITION
hive.metastore-cache-ttl=2d
hive.metastore-refresh-interval=3d
hive.metastore-cache-maximum-size=10000000

#4 File List cache

A Presto coordinator caches file lists from the remote storage partition directory to avoid long listFile calls to remote storage. This is a coordinator in-memory cache.

Enable file list cache by setting the below set of properties at

/catalog/<catalog-name>.properties

    TypeScript
   
   # List file cache
hive.file-status-cache-expire-time=24h 
hive.file-status-cache-size=100000000 
hive.file-status-cache-tables=*

#5 File Metadata Cache

Caches open file descriptors and stripe/file footer information in worker memory. These pieces of data are most frequently accessed when reading files. This cache is not just useful for decreasing query latency but also for reducing CPU utilization.

This is in the memory cache and suitable for ORC and Parquet file formats.

For ORC, it includes file tail (postscript, file footer, file metadata), stripe footer, and stripe stream (row indexes/bloom filters).

For Parquet, it caches the file and block-level metadata.

In order to enable metadata cache, set the below properties at /<catalog-name>.properties

    TypeScript
   
 

   # For ORC metadata cache: <catalog-name>.orc.file-tail-cache-enabled=true 
<catalog-name>.orc.file-tail-cache-size=100MB 
<catalog-name>.orc.file-tail-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-metadata-cache-enabled=true 
<catalog-name>.orc.stripe-footer-cache-size=100MB 
<catalog-name>.orc.stripe-footer-cache-ttl-since-last-access=6h 
<catalog-name>.orc.stripe-stream-cache-size=300MB 
<catalog-name>.orc.stripe-stream-cache-ttl-since-last-access=6h 

# For Parquet metadata cache: 
<catalog-name>.parquet.metadata-cache-enabled=true 
<catalog-name>.parquet.metadata-cache-size=100MB 
<catalog-name>.parquet.metadata-cache-ttl-since-last-access=6h
  

The <catalog-name> in the above configuration should be replaced by the catalog name that you are setting these in. For example, If the catalog properties file name is ahana_hive.properties then it should be replaced with “ahana_hive”.

#6 Affinity Scheduler

With affinity scheduling, the Presto Coordinator schedules requests that process certain data/files to the same Presto worker node to maximize the cache hits. Sending requests for the same data consistently to the same worker node means fewer remote calls to retrieve data.

Data caching is not supported with random node scheduling. Hence, this is a must-have property that needs to be enabled in order to make RaptorX Data IO, Fragment result cache, and File metadata cache work.

In order to enable the affinity scheduler set the below property at /catalog.properties

hive.node-selection-strategy=SOFT_AFFINITY

How Can You Test or Debug Your RaptorX Cache Setup With JMX Metrics?

Each section describes queries to be run and queries the JMX metrics to verify the cache usage.

Note: If your catalog is not named ‘ahana_hive,’ you will need to change the table names to verify the cache usage. Substitute ahana_hive with your catalog name.

Data IO Cache

Queries to trigger Data IO cache usage

    TypeScript
   
   USE ahana_hive.default; 
SELECT count(*) from customer_orc group by nationkey; 
SELECT count(*) from customer_orc group by nationkey;

Queries To Verify Data IO Data Cache Usage

-- Cache hit rate.
SELECT * from 
jmx.current."com.facebook.alluxio:name=client.cachehitrate.presto,type=gauges";

-- Bytes read from the cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesreadcache.presto,type=meters";

-- Bytes requested from cache
SELECT * FROM 
jmx.current."com.facebook.alluxio:name=client.cachebytesrequestedexternal.presto,type=meters";

-- Bytes written to cache on each node.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheBytesWrittenCache.presto,type=meters";

-- The number of cache pages(of size 1MB) currently on disk
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CachePages.presto,type=counters";

-- The amount of cache space available.
SELECT * from 
jmx.current."com.facebook.alluxio:name=Client.CacheSpaceAvailable.presto,type=gauges";

-- There are many other metrics tables that you can view using the below command.
SHOW TABLES FROM 
jmx.current like '%alluxio%';

Fragment Result Cache

An example of the query plan fragment that is eligible for having its results cached is shown below.

Fragment 1 [SOURCE] 
Output layout: [count_3] Output partitioning: SINGLE [] Stage Execution 
Strategy: UNGROUPED_EXECUTION 
- Aggregate(PARTIAL) => [count_3:bigint] count_3 := "presto.default.count"(*) 
- TableScan[TableHandle {connectorId='hive', 
connectorHandle='HiveTableHandle{schemaName=default, tableName=customer_orc, 
analyzePartitionValues=Optional.empty}', 
layout='Optional[default.customer_orc{}]'}, gr Estimates: {rows: 150000 (0B), 
cpu: 0.00, memory: 0.00, network: 0.00} LAYOUT: default.customer_orc{}

Queries To Trigger Fragment Result Cache Usage

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;

Query Fragment Set Result Cache JMX Metrics

-- All Fragment result set cache metrics like cachehit, cache entries, size, etc 
SELECT * FROM 
jmx.current."com.facebook.presto.operator:name=fragmentcachestats";

ORC Metadata Cache

Queries to trigger ORC cache usage:

SELECT count(*) from customer_orc; 
SELECT count(*) from customer_orc;

Query ORC Metadata cache JMX metrics:

-- File tail cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_orcfiletail,type=cachestatsmbean";

 -- Stripe footer cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripefooter,type=cachestatsmbean"; 

-- Stripe stream(Row index) cache metrics 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_stripestream,type=cachestatsmbean";

Parquet Metadata Cache

Queries to trigger Parquet metadata cache:

SELECT count(*) from customer_parquet; 
SELECT count(*) from customer_parquet;

Query Parquet Metadata cache JMX metrics:

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive_parquetmetadata,type=cachestatsmbean";

File List Cache

Query File List cache JMX metrics:

-- Verify cache usage 
SELECT * FROM 
jmx.current."com.facebook.presto.hive:name=ahana_hive,type=cachingdirectorylister";

This should help you get started with RaptorX multi-level caching with Presto. If you have specific questions about your Presto deployment, head over to the Presto open-source community Slack, where you can get help.

Metadata Open source Cache (computing) Presto (SQL query engine)

Published at DZone with permission of Rohan Pednekar. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending