Impala 2.6 brings read/write support on Amazon S3, which provides cloud capabilities such as direct querying of data from S3, elastic scaling of compute, and seamless data portability and flexibility that are unique among cloud-based analytic databases. With more and more users looking to deploy and run in public-cloud environments, Impala brings the same advantages to the cloud that it brings vs. traditional on-premise databases — serving as a modern, high-performance analytic SQL engine that enables greater business agility, no matter where the data lives.
Our recent previous blog posts (here and here) covered the exciting new cloud capabilities and use cases that Impala uniquely unlocks on AWS. Along with these advances, we are excited to share our functional and performance analysis of Impala on the AWS cloud compared to Amazon’s own traditional analytic database, Redshift.
Supporting details are to follow, but to summarize the results here:
Impala unlocks a broad, cloud-native set of use cases not possible with cloud-based databases by enabling:
- Direct querying of data from S3.
- Elastic scaling of compute.
- Data portability and flexibility.
- Seamless data across multiple filesystems.
- Transient spin-up and spin-down of clusters.
- Impala not only unlocks these unique key capabilities, but also delivers better cost efficiency and performance compared to Redshift tuned for general use: Impala is 200% more cost-efficient than Redshift and offers 4-10x better throughput on either S3 or EBS.
- Even when Redshift is pre-tuned for the specific Fixed Reporting benchmark, Impala is:
- 28% more cost-efficient and has 42% greater throughput on EBS.
- 10% more cost-efficient and has 90% greater throughput on a larger S3 cluster.
- 8% more cost-efficient and has only 17% less throughput on a smaller S3 cluster.
How Does Redshift Compare to Impala?
Impala is a modern MPP analytic database built to work with data stored on open, shared data platforms like Apache Hadoop’s HDFS filesystem, Apache Kudu’s columnar storage, and object stores like S3. By being able to query data from multiple sources stored in different, open formats like Apache Parquet, Apache Avro, and text, Impala decouples data and compute and lets users query data without having to move/load data specifically into Impala clusters.
In the cloud, this capability is especially useful as you can create transient clusters with Impala to run your reports/analytics and shut down the cluster when you are done or elastically scale compute power to support peak demands, letting you save on cluster-hosting costs. Impala is designed to run efficiently on large datasets, and scales to hundreds of nodes and hundreds of users.
In contrast, Redshift is a traditional database with a monolithic architecture (tightly coupled separate storage, metadata, and compute) based on technology Amazon acquired from ParAccel. Redshift runs on dedicated AWS instances and provides an interface to launch, load, and query data. Thus, it would be reasonable to think of Redshift as an on-premise database system that has been pre-configured to run on AWS.
Redshift has its own tightly coupled storage manager and metadata store, and a query engine that queries data stored on its clusters in a proprietary data format. Thus, Redshift takes the “traditional” data ingestion approach of first defining a fixed schema, then ETL’ing a copy of the data into its internal data store, before querying the data. Furthermore, Redshift is built to run on datasets with a few users at a time (Redshift recommends <15 concurrent queries) and more static cluster sizing. (Redshift provisions new clusters with full data copies instead of adding/removing nodes incrementally.) Due to these limitations, it is a common practice among its customers to keep Redshift clusters up and running 24/7 on static data sizes to avoid re-copying data to a new or resized Redshift cluster.
The following table captures some of the notable differences between Redshift and Impala:
Next, let’s take a look at the benchmarks.
For this analysis, we used TPC-DS on a 3TB dataset and selected 70 out of 99 the queries that run without any modifications or uses variants on both Redshift and Impala. We wanted to use a larger dataset (similar to what we’ve used in previous benchmarks), but due to Redshift’s data load times, we had to reduce the data size. (Note: This benchmark is derived from the TPC-DS benchmark and, as such, is not directly comparable to published TPC-DS results.)
For Impala: General Purpose Schema as defined by TPC-DS with fact tables partitioned on the date column.
For Redshift, we ran the benchmark on two sets of schemas:
- General purpose schema: Exactly matches Impala General Purpose Schema above where fact tables are sorted on *_dates_sk columns and as defined by TPC-DS. The General Purpose Schema workload on Redshift is similar to exploratory, self-service, and/or ad-hoc workloads, in contrast to schema design for a repeated well-known set of canned reports like you’d see in the fixed reporting schema described below.
- Fixed Reporting Schema: Includes additional schema optimizations for Redshift tuned specifically for the known TPC-DS workload. Redshift allows admins to set Primary Key, Foreign Key dependencies, sort keys, and distribution keys on the schema. While this approach helps Redshift distribute data in an optimal way during load, if you need to change distribution or sort keys in the future, Redshift will require a full table reorganization—and hence downtime. For optimal performance, Amazon recommends that these keys be set.
- Redshift prefers fewer, bigger, and more expensive nodes. Impala’s scale-out design works faster and more cost-effectively on greater number of cheaper standard cloud instance types.
- To ensure a direct, fair comparison, we used the optimal number of instances to ensure that the CPU counts available for Redshift and Impala were the same.
- The RAM/raw data ratio was also kept consistent across the testing.
Here are the details about hardware configuration:
(Important note: These results reflect the least expensive US regions for Redshift; Redshift customers running in Northern California would pay an additional 33% in cost.)
Impala on S3 offers a unique capability to scale your query engine compute independently from the data to achieve greater throughput and performance. As demonstrated previously, with the combination of cloud pay-as-you-go pricing + S3 integration + cloud elasticity, scaling to larger nodes can enable greater cost-performance by providing more throughput to complete jobs, more quickly. Thus, to measure the impact of this throughput on performance, we included Impala running 64 nodes on S3 (in addition to running 32 nodes), which is naturally a 2x hourly cost.
Time to Resume
One main advantage of the public cloud is the pay-as-you go model, where users pay only for the resources they consume.
As noted above, because of its monolithic architecture, Redshift users cannot just keep data in S3 and spin up compute on demand like Impala users can via the direct-query capability to S3. The next best thing is to sleep Redshift clusters while they aren’t used and resume them as needed.
To ensure a fair, apples-to-apples comparison, in this analysis we comparatively measured how much it would cost a customer to start a Redshift cluster and Impala cluster from sleep and get the first results. For Redshift, time-to-resume is spent restoring the snapshot from S3, so the larger the snapshot, the longer it takes to get access to a runnable, usable cluster.
As seen in the results above, compared to Impala, Redshift takes 15x longer to restore a cluster from sleep and costs 11x more than Impala on EBS. In other words, one can restore an Impala cluster from sleep in about 4.5 minutes, while it takes more than an hour to do the same on Redshift.
ETL + Single-user Tests
Modern analytic databases in the cloud ingest data from multiple sources through an ETL process. One of the most common use cases in the cloud is to perform batch ETL with Apache Hive or Apache Spark on data coming in from multiple sources, and writing it back to S3 to be read by an analytic database like Impala or Redshift for reporting and analytics.
In this test, the data files were loaded from S3 followed by compute stats on both Redshift and Impala, followed by running targeted TPC-DS queries.
From the graph above, for the same workload:
- Redshift with General Purpose Schema was significantly less performant than the other options–by 122-200%.
- Redshift with Fixed Reporting Schema was about the same cost as either of the Impala on S3 setups, but still 27% more costly than Impala on EBS.
From a pure performance standpoint, Impala on EBS was the fastest to complete all the queries serially (at 32 minutes), with Redshift’s General Purpose Schema was the slowest (at over five hours to run the same queries). Redshift on the Fixed Reporting Schema was second place at 42 minutes, followed closely by the 64-node Impala cluster on S3 at 44 minutes and the 32-node Impala cluster on S3 at 80 minutes. (We’ll cover more details about performance and cost-performance in the multi-user section below.)
As you can see from the graphs, working with ad-hoc workloads can be very slow and expensive on Redshift.
ETL + Multi-user Tests
Almost without exception, real-world BI clusters are used by multiple users running concurrent queries. In the multi-user test, we ran the same queries above in four concurrent query streams. When looking at multi-user results there are a two metrics to consider:
- Cost: The cost to complete a given workload start to finish.
- Throughput: The number of queries the system can handle within a specified duration (e.g., queries per minute).
First, from a cost standpoint, you can see from the graphs below, Impala on EBS was the cost-efficiency leader with Redshift taking the two most expensive spots:
- Redshift General Purpose was the most expensive by 275% compared to Impala on EBS, and over 200% more costly compared to either of the Impala-on-S3 clusters.
- Redshift Fixed Reporting was 28% more expensive compared to Impala on EBS, and 8-10% more costly than either of the Impala-on-S3 setups.
For a performance perspective, in the chart below we measured throughput after a fully running system (that is, removed the data load and compute stats time) to measure performance that end-users will observe at runtime.
As you can see, this time, the 64-node Impala on S3 provided the greatest performance, followed by Impala on EBS (the cost-efficiency leader per above) and with Redshift General Purpose demonstrating the lowest throughput by over 10x compared to the 64-node Impala cluster on S3, and by 7.8x compared to Impala on EBS.
Redshift’s best throughput result (with Redshift Fixed Reporting) was worse than either Impala on EBS or the 64-node Impala cluster on S3:
- Impala on EBS was 42% more performant.
- 64-node Impala on S3 was 90% more performant.
These results demonstrate that if you are planning to deploy and run your analytic database in the public cloud and you are considering Impala vs. Redshift, the choice is clear and simple. With support for both read and write on S3, for scaling compute independently of data size and without downtime, and unmatched data agility, Impala simply provides a better cloud-native experience for a number of common use cases compared to Redshift. Impala not only provides these additional key capabilities, it also delivers this superiority at a better cost-performance ratio.
Advancing Impala’s performance as part of Cloudera’s analytic database solution has been a consistent area of focus, with the latest release delivering 12x better performance on secure workloads compared to just two versions ago. You can expect this price-performance advantage to grow in the future, as well as continued cloud-focused enhancements such as support for other public-cloud object stores.