Over a million developers have joined DZone.

Caching on EMR Using RubiX: Performance Benchmark and Benefits

DZone's Guide to

Caching on EMR Using RubiX: Performance Benchmark and Benefits

Walk through how to configure your EMR Presto cluster to enable RubiX and experience the performance gain that you can see.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Last year, Qubole introduced a fast caching framework for big data engines. This framework enables big data engines such as Presto, Hive, and Spark to localize the data from a cloud object store to the local disk for faster access.

Currently, 90% of Qubole's customers using Presto have enabled RubiX by default in their clusters and they are experiencing significant performance gains for their read-intensive workloads. Today, we are announcing support for RubiX on open-source Presto. In this post, we will walk you through how to configure your EMR Presto cluster to enable RubiX and show you the performance gain we have experienced.

RubiX Architecture

RubiX works on the following basic principles.

Split Ownership Assignment

The entire file is divided into chunks, and with the help of consistent hashing, a node is assigned to be the owner of a particular chunk. RubiX uses this ownership information to provide hints to the respective task schedulers to assign tasks on those nodes.

Block-Level Caching

The whole file chunk, as mentioned above, is divided into logical blocks of a configured size (default value of which is 1MB). The metadata of these blocks (whether cached or not) is kept in the BookKeeper daemon, and also checkpointed to local disk. When a read request comes in, the request is converted into logical blocks, and on the basis of metadata of those blocks, the system decides whether to read the data from the remote location (object store in this case) or from the cache of the local or non-local nodes.

This architecture gives RubiX the following advantages:

  • Works well with columnar file format: RubiX reads and caches only the data that is required.

  • Engine-independent scheduling logic: The task assignment works on the locality of the data, which is determined by using consistent hashing of the file chunk.

  • Shared cache across JVMs: Because the metadata of the cache is stored in a single JVM outside of any engine, any big data framework can use this information to get the benefits of the cache.

Performance Analysis

To analyze the performance impact of RubiX, we compared how long a job takes to run when data is on S3 with how long the same job takes when the data is in local disk through RubiX cache.

We selected TPC-DS data of scale 500 in ORC format as our base dataset. We selected 20 queries with a good mix of interaction and reporting to eliminate any kind of bias. The queries can be found here.

For both S3 read and local RubiX cache read, the queries were run three times; the best run is represented here. For the local RubiX cache read, we have pre-populated the cache via earlier runs of the same queries.

The following graph shows the performance improvement provided by Rubix.

As you can see from the above chart, RubiX provides better performance for every query; the improvement ranges from 6 percent to 32 percent. To summarize:

  • Max improvement: 32% (Query 42)

  • Min improvement: 6% (Query 59)

  • Average improvement: 20%

The percentage improvement is calculated as:

(Time without Rubix - Time with Rubix) / Time without Rubix

Because the RubiX cache is optimized for reading the data from local disk, there are greater performance gains for read-intensive jobs. For CPU-intensive jobs, the gains made while reading the data are offset by the time spent on CPU cycles.

Setting Up Rubix

Cluster configuration:

  • Presto version: 0.184

  • Number of slave nodes: 2

  • Node type: r3.2xlarge

  1. Log in to the master node as the Hadoop user once the cluster is up and set up passwordless SSH among the cluster nodes:
ssh <key> hadoop@master-public-ip
  1. Install Rubix Admin on the master node. This is the admin tool for installing RubiX on the cluster and starting the necessary daemons.
sudo yum install gcc libffi-devel python-devel openssl-devel
sudo pip install rubix_admin rubix_admin -h

This creates a file ~/.radminrc with the following content:

- localhost
remote_packages_path: /tmp/rubix_rpms
  1. Add the public DNS addresses of the slave nodes in the hosts section.
  2. Use rubix_admin to install the latest version of RubiX.
rubix_admin installer install

You can also specify the rpm path if you need to install any other version of RubiX:

rubix_admin installer install --rpm <path to the rpm>
  1. Once RubiX is installed, start the RubiX daemons:
rubix_admin daemon start
  1. Verify that the daemons are up:
sudo jps -m
pid1 RunJar /usr/lib/rubix/lib/rubix-bookkeeper-0.2.12.jar com.qubole.rubix.bookkeeper.BookKeeperServer
pid2 RunJar /usr/lib/rubix/lib/rubix-bookkeeper-0.2.12.jar com.qubole.rubix.bookkeeper.LocalDataTransferServer
  1. If the daemons have not started, check the logs at the following locations:
For BookKeeper -- /var/log/rubix/bks.log
For LocalDataTransferServer -- /var/log/rubix/lds.log

Running a Simple Example

We will run a very simple Presto query that reads data from a remote location in S3 and does some basic aggregation. The first run of the query warms up the cache by downloading the required data, and subsequent runs use the local cache data to do the necessary aggregation.

  1. Run Hive as follows to create the external tables:
hive --hiveconf hive.metastore.uris="" --hiveconf fs.rubix.impl=com.qubole.rubix.hadoop2.CachingNativeS3FileSystem

We are starting Hive pointing to its embedded Metastore server. As the RubiX-related JARs are pushed ito then Hadoop lib directory during RubiX installation, the thrift Metastore server needs to restarted to be aware of the custom scheme rubix. We also need to set the awsAccessKeyId and awsSecretAccessKey for the rubix scheme by setting fs.rubix.awsAccessKeyId and fs.rubix.awsSecretAccessKey.

  1. Run the following command in Hive to create an external table:
CREATE EXTERNAL TABLE wikistats_orc_rubix 
(language STRING, page_title STRING,
hits BIGINT, retrived_size BIGINT)
LOCATION 'rubix://emr.presto.airpal/wikistats/orc';
  1. Start the Presto client:
presto-cli --catalog hive --schema default
  1. Run the following query:
SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats_orc_rubix
WHERE language = 'en'
AND page_title NOT IN ('Main_Page',  '404_error/')
AND page_title NOT LIKE '%index%'
AND page_title NOT LIKE '%Search%'
GROUP BY language, page_title
ORDER BY avg_hits DESC

The cache statistics are pushed to MBean named rubix:name=stats. We can run the following query to see the stats:

SELECT Node, CachedReads, 
ROUND(ExtraReadFromRemote,2) AS ExtraReadFromRemote, 
ROUND(HitRate,2) AS HitRate, 
ROUND(MissRate,2) AS MissRate,  
ROUND(NonLocalDataRead,2) AS NonLocalDataRead, 
ROUND(ReadFromCache,2) AS ReadFromCache, 
ROUND(ReadFromRemote, 2) AS ReadFromRemote, 
FROM jmx.current."rubix:name=stats";

After the first run (cache warmup) of the Presto job, the numbers look like the following:

Node Cached Reads Extra ReadFrom Remote Hit Rate Miss Rate NonLocal DataRead NonLocal Reads Read FromCache ReadFrom Remote Remote Reads
node1 17 16.59 0.13 0.87 0.17 496 6.53 500.14 109
node2 23 35.07 0.1 0.9 0.14 135 9.5 306.47 207


RubiX was started to eliminate I/O latency from object stores in public clouds for fast ad-hoc data analysis. The architecture of RubiX is a result of running shared storage, auto-scaling Presto, Hive, and Spark clusters that process petabytes of data stored in columnar formats. We have made RubiX available in open source to work with the big data community to solve similar issues across big data engines, Hadoop services, and distributions on-premise or in public clouds. If you are interested in using RubiX or developing on it, please join the community through the links below.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

big data ,rubix ,caching ,database performance ,data analytics ,latency ,benchmark

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}