3x Faster Interactive Query With Apache Hive LLAP
3x Faster Interactive Query With Apache Hive LLAP
Learn about the massive step forward Hive has made in SQL compliance, enabling Hive to run all 99 TPC-DS queries with only trivial modifications to the original queries.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
One of the most exciting new features of HDP 2.6 from Hortonworks was the general availability of Apache Hive with LLAP. If you missed DataWorks Summit, you’ll want to look at some of the great LLAP experiences our users shared, including Geisinger who found that Hive LLAP outperforms their traditional EDW for most of their queries and Comcast who found Hive LLAP is faster than Presto for 75% of benchmark queries.
These great results are thanks to performance and stability improvements Hortonworks made to Hive LLAP resulting in 3x faster interactive query in HDP 2.6. This blog dives into the reasons HDP 2.6 is so much faster. We’ll also take a look at the massive step forward Hive has made in SQL compliance with HDP 2.6, enabling Hive to run all 99 TPC-DS queries with only trivial modifications to the original source queries.
3x Performance Gains in HDP 2.6 With Hive LLAP
Let’s start out with a summary of runtimes between Hive LLAP on HDP 2.5 versus HDP 2.6 on an identical nine-node cluster (details at the end of the doc), using queries from the TPC-DS suite as used in previous benchmarks. Because of SQL gaps in HDP 2.5 and older versions, these queries had re-writes versus the source TPC-DS queries, which are no longer needed in HDP 2.6 and are therefore referred to as “legacy queries.”
A few key points:
- HDP 2.6 runs this set of queries in 5155.3 seconds, compared to 14983.4 seconds for HDP 2.5 — almost a 3x improvement for HDP 2.6.
- HDP 2.6 is faster than HDP 2.5 on almost all queries.
- HDP 2.6 is more than 5x faster than HDP 2.5 on eight queries.
Now, let’s dive into the reasons that HDP 2.6 is so much faster than HDP 2.5.
Hive Working Smarter, Not Harder
Hive’s core execution engine and operators are highly optimized and are designed to work end-to-end with the ORCFile columnar format and LLAP’s in-memory cache. Given this level of optimization, the majority of Hive’s performance improvements in HDP 2.6 come from more sophisticated cost-based optimization along with dynamic runtime features that minimize the overall work Hive must do to satisfy a query. You can say that with HDP 2.6, Hive is working smarter and not harder, re-using its own work or avoiding it altogether when possible.
Hive LLAP Supports the Full Suite of 99 TPC-DS Queries
Past Hive benchmarks have focused on a subset of the TPC-DS queries. In large part, this was because older versions of Hive lacked SQL features used by TPC-DS queries. With HDP 2.6, Hive is able to run all 99 TPC-DS queries with only trivial modifications (defined as simple, mechanical rewrites such as changing column names/aliases, adding columns to the select list, and other simple transformations.) For reference, we’ve posted the full set of trivially-modified queries used in the remainder of this blog post in the updated Hortonworks testbench repository.
Comparing Apache Hive to Apache Impala
Apache Impala is a SQL-on-Hadoop engine built specifically for interactive queries. For this benchmark, we compared Hive to Impala using the identical set of 99 trivially modified TPC-DS queries, which can be found in our GitHub repository.
To summarize the results, the aggregate runtime for all queries is similar across the two engines, but Hive is able to run all 99 queries compared to only 60 in Impala.
This graph breaks down a full comparison of all runtimes across the 60 queries, which could be run in both engines.
When we drill into why Impala is only able to run 60 out of the full suite of 99 queries, this is what we find:
These missing features are not obscure SQL features. Instead, they are features offered by almost all commercial SQL products and an ever-growing list of open-source SQL tools like Apache Hive. Some of these limitations can be dealt with on a query-by-query basis (such as lack of a date datatype), but such changes are impractical across an entire SQL suite or workload. Other Impala limitations require extremely difficult workarounds, such as missing support for rollups or intersections. Overall, Hive offers much better workload portability if you are considering offloading workloads from a legacy EDW solution.
Hive LLAP Performance Across All 99 Queries
For reference, this graph shows Hive performance across all 99 trivially-modified TPC-DS queries.
And that's it!
Published at DZone with permission of Nita Dembla , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.