Last year, Qubole introduced a fast caching framework for big data engines. This framework enables big data engines such as Presto, Hive, and Spark to localize the data from a cloud object store to the local disk for faster access.
Currently, 90% of Qubole's customers using Presto have enabled RubiX by default in their clusters and they are experiencing significant performance gains for their read-intensive workloads. Today, we are announcing support for RubiX on open-source Presto. In this post, we will walk you through how to configure your EMR Presto cluster to enable RubiX and show you the performance gain we have experienced.
RubiX works on the following basic principles.
Split Ownership Assignment
The entire file is divided into chunks, and with the help of consistent hashing, a node is assigned to be the owner of a particular chunk. RubiX uses this ownership information to provide hints to the respective task schedulers to assign tasks on those nodes.
The whole file chunk, as mentioned above, is divided into logical blocks of a configured size (default value of which is 1MB). The metadata of these blocks (whether cached or not) is kept in the BookKeeper daemon, and also checkpointed to local disk. When a read request comes in, the request is converted into logical blocks, and on the basis of metadata of those blocks, the system decides whether to read the data from the remote location (object store in this case) or from the cache of the local or non-local nodes.
This architecture gives RubiX the following advantages:
Works well with columnar file format: RubiX reads and caches only the data that is required.
Engine-independent scheduling logic: The task assignment works on the locality of the data, which is determined by using consistent hashing of the file chunk.
Shared cache across JVMs: Because the metadata of the cache is stored in a single JVM outside of any engine, any big data framework can use this information to get the benefits of the cache.
To analyze the performance impact of RubiX, we compared how long a job takes to run when data is on S3 with how long the same job takes when the data is in local disk through RubiX cache.
We selected TPC-DS data of scale 500 in ORC format as our base dataset. We selected 20 queries with a good mix of interaction and reporting to eliminate any kind of bias. The queries can be found here.
For both S3 read and local RubiX cache read, the queries were run three times; the best run is represented here. For the local RubiX cache read, we have pre-populated the cache via earlier runs of the same queries.
The following graph shows the performance improvement provided by Rubix.
As you can see from the above chart, RubiX provides better performance for every query; the improvement ranges from 6 percent to 32 percent. To summarize:
Max improvement: 32% (Query 42)
Min improvement: 6% (Query 59)
Average improvement: 20%
The percentage improvement is calculated as:
(Time without Rubix - Time with Rubix) / Time without Rubix
Because the RubiX cache is optimized for reading the data from local disk, there are greater performance gains for read-intensive jobs. For CPU-intensive jobs, the gains made while reading the data are offset by the time spent on CPU cycles.
Setting Up Rubix
Presto version: 0.184
Number of slave nodes: 2
Node type: r3.2xlarge
- Log in to the master node as the Hadoop user once the cluster is up and set up passwordless SSH among the cluster nodes:
ssh <key> hadoop@master-public-ip
- Install Rubix Admin on the master node. This is the admin tool for installing RubiX on the cluster and starting the necessary daemons.
sudo yum install gcc libffi-devel python-devel openssl-devel sudo pip install rubix_admin rubix_admin -h
This creates a file
~/.radminrc with the following content:
hosts: - localhost remote_packages_path: /tmp/rubix_rpms
- Add the public DNS addresses of the slave nodes in the
rubix_adminto install the latest version of RubiX.
rubix_admin installer install
You can also specify the
rpm path if you need to install any other version of RubiX:
rubix_admin installer install --rpm <path to the rpm>
- Once RubiX is installed, start the RubiX daemons:
rubix_admin daemon start
- Verify that the daemons are up:
sudo jps -m pid1 RunJar /usr/lib/rubix/lib/rubix-bookkeeper-0.2.12.jar com.qubole.rubix.bookkeeper.BookKeeperServer pid2 RunJar /usr/lib/rubix/lib/rubix-bookkeeper-0.2.12.jar com.qubole.rubix.bookkeeper.LocalDataTransferServer
- If the daemons have not started, check the logs at the following locations:
For BookKeeper -- /var/log/rubix/bks.log For LocalDataTransferServer -- /var/log/rubix/lds.log
Running a Simple Example
We will run a very simple Presto query that reads data from a remote location in S3 and does some basic aggregation. The first run of the query warms up the cache by downloading the required data, and subsequent runs use the local cache data to do the necessary aggregation.
- Run Hive as follows to create the external tables:
hive --hiveconf hive.metastore.uris="" --hiveconf fs.rubix.impl=com.qubole.rubix.hadoop2.CachingNativeS3FileSystem
We are starting Hive pointing to its embedded Metastore server. As the RubiX-related JARs are pushed ito then Hadoop
lib directory during RubiX installation, the thrift Metastore server needs to restarted to be aware of the custom scheme
rubix. We also need to set the
awsSecretAccessKey for the
rubix scheme by setting
- Run the following command in Hive to create an external table:
CREATE EXTERNAL TABLE wikistats_orc_rubix (language STRING, page_title STRING, hits BIGINT, retrived_size BIGINT) STORED AS ORC LOCATION 'rubix://emr.presto.airpal/wikistats/orc';
- Start the Presto client:
presto-cli --catalog hive --schema default
- Run the following query:
SELECT language, page_title, AVG(hits) AS avg_hits FROM default.wikistats_orc_rubix WHERE language = 'en' AND page_title NOT IN ('Main_Page', '404_error/') AND page_title NOT LIKE '%index%' AND page_title NOT LIKE '%Search%' GROUP BY language, page_title ORDER BY avg_hits DESC LIMIT 10;
The cache statistics are pushed to MBean named
rubix:name=stats. We can run the following query to see the stats:
SELECT Node, CachedReads, ROUND(ExtraReadFromRemote,2) AS ExtraReadFromRemote, ROUND(HitRate,2) AS HitRate, ROUND(MissRate,2) AS MissRate, ROUND(NonLocalDataRead,2) AS NonLocalDataRead, NonLocalReads, ROUND(ReadFromCache,2) AS ReadFromCache, ROUND(ReadFromRemote, 2) AS ReadFromRemote, RemoteReads FROM jmx.current."rubix:name=stats";
After the first run (cache warmup) of the Presto job, the numbers look like the following:
|Node||Cached Reads||Extra ReadFrom Remote||Hit Rate||Miss Rate||NonLocal DataRead||NonLocal Reads||Read FromCache||ReadFrom Remote||Remote Reads|
RubiX was started to eliminate I/O latency from object stores in public clouds for fast ad-hoc data analysis. The architecture of RubiX is a result of running shared storage, auto-scaling Presto, Hive, and Spark clusters that process petabytes of data stored in columnar formats. We have made RubiX available in open source to work with the big data community to solve similar issues across big data engines, Hadoop services, and distributions on-premise or in public clouds. If you are interested in using RubiX or developing on it, please join the community through the links below.