Accelerate Spark and Hive Jobs on AWS S3 by 10x With Alluxio Tiered Storage
Learn how you can maximize performance while minimizing operating costs on AWS S3.
Join the DZone community and get the full member experience.Join For Free
In this article, we describe how to leverage Alluxio to build a tiered storage architecture with AWS S3 that can maximize performance and minimize operating costs while running big data analytics on AWS EC2. This blog aims to provide the following takeaways:
- Common challenges in performance and cost to build an efficient big data analytics platform on AWS.
- How to set up Hive metastore to leverage Alluxio as the storage tier for “hot tables” backed by all tables on AWS S3 as the source of truth.
- Benchmark results of micro and real-world workloads.
Our Big Data Workload and Platform on AWS
To keep up with the web traffic, we host our services via Amazon Web Service. Our big data platform relies on a fully open-source Hadoop ecosystem, utilizing tools such as Apache Hive, Spark for ETLs, Kafka, Storm for unbounded dataset and processing, Cassandra, ElasticSearch, and HBase for durable datastore and long-term aggregations. The data generated by these various services will eventually be stored in S3 in optimized formats for analytics such as Parquet and ORC.
With AWS S3, we are able to scale out our storage capacity effortlessly, except that it exposes certain limitations for our needs and prevents us from further scaling “up.” For instance, you cannot access an S3 file faster than what S3 allows at a connection level. You cannot tell S3 to cache data in memory or be co-located with an application for faster analytics and repeated query. Or worse, if you don’t provision enough hardware to process S3 dataset in parallel, then scanning 1TB of data is just too slow. S3 is also not open-source, limiting our ability to customize it to fit our specific needs.
A simple solution to this problem could be to add more hardware and spend more money. However, with a limited budget in mind, we set out to find something better in late 2017 to solve the scaling up S3 challenge.
The goals were to accelerate data access on S3 via a tiered storage system. We realized that not all data access need to be fast: the access pattern of our workload is often selective and changing. As a result, we have the following goals in mind for this tiered storage:
- It should be nimble: it could expand and contract as our business grows without data movement and ETL;
- It should provide the same level of interoperability for most of our existing systems;
- It should allow us to simply “scale up” by having better hardware or when we have the budget for it. Thus, the system needs to be highly configurable and relatively cheap to reconfigure operational-wise.
We picked Alluxio for meeting many of our criteria. Initial micro benchmarks have shown great performance improvement over accessing directly on S3 in Spark and Hive. For example, some of our tasks that scanned the same dataset repeatedly would reduce their execution time by 10-15x on the second run.
Architecture With Alluxio
We integrated Alluxio with Hive Metastore as a basis for the tiered storage S3 acceleration layer. Clusters can be elastically provisioned with support for our tiered storage layer.
In each job cluster (Hive on Tez and Spark) or interactive cluster (Hive 3 on LLAP), a table could be altered to have their last weeks or months worth of data configured to use Alluxio, while the rest of the data still referenced data on S3 directly.
Since the data is not cached in Alluxio unless it is accessed via a Hive or Spark task, there’s no data movement and a table reconfiguration is extremely cheap. This has allowed us to be very nimble to adapt to the changing querying pattern.
For example, let us look at our page view dataset. A page view is an event recorded when an Internet user visited one of the pages of our client’s websites. The raw data is collected and converted into Parquet format for every new hour worth of data. This data is stored in S3 in a year, month, day and hour in a hierarchical structure:
s3://some-bucket/ |- pageview/ |- 2019/ |- 02/ |- 14/ |- 01/ |- 02/ |- 03/ |- pageview_file1.parquet |- pageview_file2.parquet
The data is registered to one of our clusters via Hive Metastore and available to be used in Hive or Spark for analytics, ETLs, and various purposes. An internal system takes care of this registration task. It updates the Hive Metastore directly for every new dataset that it detected.
# add partition from s3. equivalent to always referencing “cold” data ALTER TABLE pageview ADD PARTITION (year=2019, month=2, day=14, hour=3) LOCATION ‘s3://<bucket>/pageview/2019/02/14/03’
In addition, our system is also aware of the tiered storage and a set of configuration provided for each specific cluster and table.
For instance, in an interactive cluster where analysts are analyzing the last few weeks of trending data, the partitions from the table
pageview that are less than a month old are configured to read directly from Alluxio filesystem. An internal system reads this configuration to mount the S3 bucket to the cluster via Alluxio REST API. Then, it automatically takes care of promoting/demoting tables and partitions using the following Hive DDLs:
# promoting. repeating query will cache the data from cold->warm->hot ALTER TABLE pageview PARTITION (year=2019, month=2, day=14, hour=3) SET LOCATION ‘alluxio://<address>/mnt/s3/pageview/2019/02/14/03’ # demoting. 1 month older data goes back to the “cold” tier. # this protects our cache when people occasionally read older data ALTER TABLE pageview PARTITION (year=2019, month=1, day=14, hour=3) SET LOCATION ‘s3://<bucket>/pageview/2019/01/14/03’
This process happens asynchronously and continuously as newer data arrives, older data needs to be demoted, or configuration for the tiered storage has changed.
Micro- and Real-World Benchmark Results
Our Alluxio + ZFS + NVMe SSD read microbenchmark is performed on an i3.4xlarge AWS instance with up to 10 Gbit network, 128GB of RAM, and two 1.9TB NVMe SSD. We mount a real-world, production-S3 bucket to Alluxio and perform two read tests. The first test downloads about 5Gb of Parquet data recursively using AWS CLI to ramdisk (to measure only read performance).
# using AWS cli to copy recursively to RAM disk $ time aws s3 cp --recursive s3://<bucket>/dir /mnt/ramdisk
The second test uses Alluxio CLI to download the same Parquet data also to ramdisk. This time, we perform the test three times to get the cold, warm, and hot numbers, as shown in the chart above.
# using AWS cli to copy recursively to RAM disk $ time ./bin/alluxio fs copyToLocal /mnt/s3/dir /mnt/ramdisk
Alluxio v1.7 with ZFS and NVMe is taking about 78 percent longer to perform a cold read when compared to S3 (66s vs. 37s). The successive warm and hot read are 2.5 and 4.6 times faster than reading directly from S3.
However, microbenchmarks don’t always tell the whole story, thus we will take a look at a few real-world queries executed by actual analysts.
We rerun the queries on the tiered storage system (the default configuration) and S3 (by turning off the tired storage) on our production interactive cluster on a weekend. The production cluster is a 20 x i3.4xlarge nodes running Hadoop 3, Hive 3.0 on Alluxio 1.7 and ZFS v0.7.12-1.
Query 1 is a single table deduplication query with 6 columns to group by. Query 2 is a more complex 4 tables join query with aggregations and group by on 5 columns. Query 1 processes 95M input records and 8G of Parquet data from a 200G dataset. Query 2 processes 1.2B input records and 50G of Parquet data from several TB datasets.
The two queries simplified for readability are shown below. Compared to running the queries directly on S3, the tiered storage system has sped up the query 1 by 11x and query 2 by 6x.
# query 1 simplified SELECT dt, col1, col2, col3, col4, col5 FROM table_1 n WHERE lower(col6) = lower('<some_value>') AND month = 1 AND year = 2019 AND (col7 IS NOT NULL OR col8 IS NOT NULL) GROUP BY dt, col1, col2, col3, col4, col5 ORDER BY dt DESC LIMIT 100; # query 2 simplified SELECT c1, c2, c3, c4, c5, SUM(col3 / col4), SUM(col5 * col6) FROM table_2 s JOIN table_3 p ON s.col8 = p.col9 JOIN table_4 e ON s.col10 = s.col11 JOIN table_5 t ON s.col12 = s.col13 WHERE dt >= last_few_months() GROUP BY c1, c2, c3, c4, c5
Alluxio, ZFS, and the tiered storage architecture have helped us saved a significant amount of time for analysts. AWS S3 is easy to scale out and by augmenting it with a tiered storage configuration that is nimble and cheap to adapt, we can focus on growing our business and “scale up” the storage as needed.
Today, the current production config can handle about 35T of data in the cache and half a petabyte of data on S3. To grow tomorrow, we can simply add more nodes to enable more cached datasets, upgrade the hardware (i3.4xlarge -> i3.8xlarge -> i3.metal) for quicker localized access, or a combination of both.
Opinions expressed by DZone contributors are their own.