DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Easiest Way to Benchmark Spark + Alluxio + S3 Stack With TPC-DS Queries on AWS

Easiest Way to Benchmark Spark + Alluxio + S3 Stack With TPC-DS Queries on AWS

Learn more about how you can benchmark Spark, Alluxio, and S3 Stack with TPC-DS queries on AWS.

Rico Chiu user avatar by
Rico Chiu
·
Mar. 15, 19 · Tutorial
Like (4)
Save
Tweet
Share
5.54K Views

Join the DZone community and get the full member experience.

Join For Free

The Alluxio sandbox is the easiest way to test drive the popular data analytics stack of Spark, Alluxio, and S3 deployed in a multi-node cluster in a public cloud environment. The sandbox cluster is fully configured and ready for users to run applications ranging from the hello-world example to the TPC-DS benchmark suite. Don’t take our word for it; kick off the benchmark yourself to see the performance benefits of running Spark jobs that interface through Alluxio on S3 compared to running Spark jobs directly on S3. It is extremely easy to request and launch a sandbox cluster as a playground for 24 hours at no cost to you.

Cluster Details

The sandbox cluster consists of 2 master nodes and 4 workers nodes using r4.2xlarge EC2 instances. Alluxio, currently at version 1.8.1, is configured to use a S3 bucket as its root under file storage. It is deployed in high availability mode with leading and backup master nodes. To run TPC-DS, Apache Spark is deployed with its master on the first master node and a worker on each of the worker nodes. Note that the Spark workers are co-located with the Alluxio workers to possibly leverage data locality provided by Alluxio local file system in memory.

cluster details

Performance Benchmarking

TPC-DS is an industry standard benchmark suite, measuring performance on workloads that derived from real-world scenarios. We follow the experiment conducted by Databricks in this blog post and run 3 groups of queries: interactive, reporting, and deep analytics, reporting the cumulative runtime as a result. The 26GB dataset is pre-generated and copied to a new S3 bucket as part of creating the sandbox cluster.

performance benchmarking

To generate the baseline — the control run is executed. We run the benchmark with a S3 bucket as the input and for the output directories.

For the run with Alluxio, we preload the dataset into Alluxio, distributing a copy of the dataset amongst its workers. This simulates the scenario where the storage local to the compute provided by Alluxio is hydrated and data is warmed up. This approach is typical for Alluxio deployments.

The benchmark is run with an Alluxio URI as the input and output directories. Below is a sample output of the two iterations:

sample output

The Results

We have seen that the run with Alluxio providing data locality to compute nodes typically shows a performance improvement ranging from 45 percent to 300 percent, depending on the time of day. This is with a stack running solely on AWS; for hybrid cloud scenarios, such as Spark and Alluxio in AWS using data from an on premise HDFS cluster, the performance improvement can be greater, depending on the network speed between the on premise storage cluster and the public cloud. We have seen a range of numbers from our open-source community users, up to 10x improvement.

Improving analytics and machine learning workloads even by 50 percent can have a huge impact; both decreasing the cost of computing resources and increasing the efficiency of data analysts, enabling them to run more reports or develop more models. As an example, in this run, we see that the total time of the 3 runs without Alluxio was 281 seconds and with Alluxio was 129.64 seconds. This equates to a total performance gain of x2.17. This reduction in time means that computing resources can be taken down faster to reduce costs.

Come Get Your Sandbox!

come get your sandbox

After signing up on our website, you will receive an email with instructions to SSH into an EC2 instance. From this instance, you can freely manipulate the sandbox cluster via a binary that issues the commands to create, run tests on, and destroy the cluster. Feel free to SSH into the cluster and edit configurations or restart processes; you can always destroy and recreate the cluster to restore its initial state. Note the upkeep costs for the EC2 instances are covered by Alluxio, and so, the sandbox cluster will only be available for 24 hours.

Other Stack?

We hope that the sandbox stack of Spark and S3 is familiar to many of our users, but this is only one of many combinations that can work with Alluxio. If you would like to see a different sandbox stack using other compute frameworks and/or storage systems, we’d love to hear from you. Reach out to our mailing list or slack channel. Any feedback about the sandbox is always appreciated.

AWS Alluxio Machine learning

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Building a RESTful API With AWS Lambda and Express
  • Using GPT-3 in Our Applications
  • 7 Most Sought-After Front-End Frameworks for Web Developers
  • 3 Main Pillars in ReactJS

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: