Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Apache Impala in CDH 5.7: 4x Faster for BI Workloads on Apache Hadoop

DZone's Guide to

Apache Impala in CDH 5.7: 4x Faster for BI Workloads on Apache Hadoop

Impala 2.5, now shipping in CDH 5.7, brings significant performance improvements and some highly requested features.

· Big Data Zone
Free Resource

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Impala has proven to be a high-performance analytics query engine since the beginning. Even as an initial production release in 2013, it demonstrated performance 2x faster than a traditional DBMS, and each subsequent release has continued to demonstrate the wide performance gap between Impala’s analytic-database architecture and SQL-on-Apache Hadoop alternatives. Today, we are excited to continue that track record via some important performance gains for Impala 2.5 (with more to come on the roadmap), summarized below.

Overall, compared to Impala 2.3, in Impala 2.5:

  • TPC-DS queries run on average 4.3x faster.
  • TPC-H queries run 2.2x faster on flat tables, and 1.71x faster on nested tables.

impala25-f1

Figure 1: Performance comparison of Impala 2.5 vs Impala 2.3 on TPC-DS and TPC-H. (See more details about workloads in Appendix.)

Next, let’s review some of the new performance enhancements, as well as some other significant features, in detail. (For a complete list, check out the release notes here.)

Runtime Filters and Dynamic Partition Pruning

One of the important strategies in SQL query optimization is to find ways to reduce the number of rows in leaf nodes in the execution-plan tree. Runtime filters in Impala 2.5 do exactly that by computing filters during query execution and propagating them from “upstream” to “downstream” operators. Doing that reduces the amount of work performed not only by the scanner but also by upstream operators. When applied on a partition column, runtime filters automatically prune partitions (this special type of runtime filtering is called dynamic partition pruning) before Impala even starts scanning the tuples of a partitioned table, significantly reducing the I/O cost of executing a query.

impala25-f2Figure 2: With runtime filters, Impala 2.5 is on average 2.2x faster than 2.3 on TPC-H benchmark queries (on a 1TB dataset), with TPC-H Q17 running 23x faster.

Learn more about runtime filters and dynamic partition pruning in the docs.

Faster Query Startup

In previous releases, when queries started execution, Impala would start individual fragments one “level” of the plan tree at a time to ensure that receivers of data were always ready when the senders started. This approach led to a long start-up delay, particularly for complex queries with many fragments (or in large clusters). In Impala 2.5, instead of starting fragments in wave after wave, the query start-up logic allows fragments to be started in any order, thereby increasing parallelism and reducing query start-up latencies.

Figure 3 shows the resulting performance boost in query-startup times, which often translate to a speedup in overall query execution.

impala25-f3Figure 3: Query start-up time in Impala 2.5 is significantly faster than in Impala 2.3.

Improved Cardinality Estimates and Join Order

Impala 2.5 brings a more robust scan cardinality estimation by mitigating correlated predicates with an exponential back-off. Furthermore, join cardinality estimation is also improved. These improvements, along with a better join order selection (broadcast vs. shuffle), deliver up to 8x gains for complex analytical queries like TPC-H Q8.

LLVM Codegen Coverage

Impala 2.5 adds LLVM codegen support for SORT and Top-N operations as well as arithmetic operations on DECIMALs. As both SORT and Top-N are commonly used functions by many third-party BI tools with which Impala integrates, these improvements in codegen will be readily noticeable by end-users.

impala25-f4

Figure 4: With codegen for Top-N, Impala 2.5 is 1.72x more efficient on this benchmark query.

Learn more about LLVM codegen support in Impala here.

Catalog Improvements: Incremental Metadata Update

Impala 2.5 brings several catalog improvements that allow incremental update of table metadata instead of force-reloading all table metadata during DDL/DML operations. By reloading metadata of only “dirty” partitions and reusing descriptors of HDFS files to avoid loading file/block metadata for files that haven’t been modified, Impala 2.5 significantly reduces the latency of DDL/DML operations (by up to 4x).

impala25-f5Figure 5: Impala 2.5 is on average 4x faster on catalog-update stress tests than 2.3.

Improvements to Admission Control

Since version 1.4, Impala has provided admission control capabilities to let users throttle workloads to avoid oversubscription and to maximize throughput based on number of concurrent queries. Impala 2.5 supports memory-based admission control that can be configured and monitored directly from Cloudera Manager; read more about managing the Admission Control feature in the docs.

Improved Scalability and Reliability

One of the top priorities for Cloudera Engineering over the past year has been to provide even better scalability and reliability. Impala 2.5 is huge step forward in that direction. Impala 2.5 went through rigorous stress, scale, fault-injection, and performance testing using a new in-house query generator specifically built for Impala.

Distributed Aggregations

Aggregations in Impala happens in two phases: pre-aggregation and merge. The pre-aggregation phase greatly reduces network traffic if there are many input rows per grouping value, but sometimes they are not effective and can spill to disk under memory pressure. Impala 2.5 improves aggregations by deciding at run time whether it is more efficient to do an initial aggregation phase and pass along a smaller set of intermediate data, or to pass raw intermediate data back to next phase of query processing to be aggregated there. Doing so, Impala 2.5 speeds up certain aggregations of up to 25% and reduces memory consumption up to 50% or more. In addition, streaming pre-aggregations don’t spill to disk.

impala25-f6Figure 6: In Impala 2.5, some aggregations are up to 25% faster.

DECIMAL Arithmetic Improvements

Starting in Impala 1.4, the DECIMAL data type lets you store fixed-precision values, which are commonly used for working with currency values where it is important to represent values exactly and avoid rounding errors. Impala 2.5 speeds up aggregations involving DECIMAL fields by automatically triggering native code generation.

As shown in Figure 7, on the benchmark query Impala 2.5 runs at least 3x faster on aggregations involving DECIMAL compared to Impala 2.3. Impala 2.5 bridges the performance gap between DECIMAL and FLOAT/DOUBLE, letting you run high-precision operations much faster.

impala25-f7 Figure 7: Impala 2.5 runs at least 3x faster on DECIMAL-related aggregations compared to 2.3.

Learn more about this feature in the docs.

Metadata-only Queries

Impala 2.5 speeds up min(), max(), ndv(), and aggregate functions with distinct keywords that involve only the partition-key columns of partitioned tables by using metadata to avoid table accesses for partition-key scans. These functions are commonly used by third-party BI tools, so most end-users will experience a noticeable improvement in performance.

impala25-f8Figure 8: Performance of metadata-only queries is significantly improved in Impala 2.5.

Conclusion

Performance has always been a top priority for the Impala team because interactive query latency is so important for BI and analytics users. Less than a year ago, the team outlined its roadmap for 2015-2016 and these performance improvements reflect a huge step forward on that plan. That step is just the first of several more this year, however: with several other performance optimizations (such as multi-core joins and aggregations) on the community’s roadmap, the goal of 20x-plus performance gains for Impala is in view.

If you’re interested in contributing to Impala to help bring this vision to fruition, those contributions are very welcome!

Devadutta Ghat is a Senior Product Manager at Cloudera.

Marcel Kornacker is a Tech Lead at Cloudera and the founder of Impala.

Mostafa Mokhtar is a Software Engineer at Cloudera.

Henry Robinson is a Software Engineer at Cloudera.

Appendix

TPC-DS and TPC-H benchmark details:

  • 24 unmodified TPC-DS queries {3, 7, 8, 19, 25, 27, 34, 42, 43, 46, 47, 52, 53, 55, 59, 61, 63, 68, 73, 79, 88, 89, 96, 98} on 15TB dataset
  • 22 TPC-H queries on 15TB dataset

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
execution ,bi ,query ,runtime ,filters ,impala ,tools ,optimization ,logic ,time

Published at DZone with permission of Justin Kestelyn. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}