Ultra-Fast OLAP Analytics With Apache Hive and Druid (Part 2)
To help organizations understand more about the benefits of Apache Hive and Druid, this article discusses how to achieve subsecond analytics with Apache Hive and Druid.
Join the DZone community and get the full member experience.Join For Free
This is part 2 of a three-part series on doing ultra fast OLAP analytics with Apache Hive and Druid. Check out part 1 here.
Modern corporations are increasingly looking for near real-time analytics and insights to make actionable decisions. To help organizations understand more about the benefits of Apache Hive and Druid, we'll focus on how you can achieve subsecond analytics with Apache Hive and Druid.
Previously, we talked about the reasons for integrating Hive and Druid, including blazing fast queries, high-dimensional analytics, and support for real-time architectures. Today, we present results of an OLAP benchmark at 1TB scale that show an average query response time of less than one second, as well as truly interactive response times across the entire suite of queries.
SSB 1TB scale with Hive over 10 Druid nodes.
Taking Hive/Druid Out for a Spin: The SSB Benchmark and Queries
The Star-Schema Benchmark, or SSB Benchmark, is “designed to measure the performance of database products in support of classical data warehousing applications.” The SSB is based on TPC-H but differs in a number of ways: particularly in the way the largest tables are pre-joined to facilitate faster analytics. Overall, the SSB is meant to simulate the process of iteratively and interactively querying a data warehouse to play what-if scenarios, to drill down and better understand trends, as opposed to the pre-canned, batch-style reports used by TPC-H. It’s vital that these sorts of queries are answered within a few seconds.
The SSB Queries
The SSB queries are all spelled out in the specification document linked above, but let’s look at one query, Q4.2, to get the flavor of the benchmark:
SELECT d_year, s_nation, p_category, sum(lo_revenue – lo_supplycost) AS profit FROM dates, customer, supplier, part, lineorder WHERE lo_custkey = c_custkey AND lo_suppkey = s_suppkey AND lo_partkey = p_partkey AND lo_orderdate = d_datekey AND c_region = ‘america’ AND s_region = ‘america’ AND ( d_year = 1997 OR d_year = 1998) AND ( p_mfgr = ‘mfgr#1’ OR p_mfgr = ‘mfgr#2’) GROUP BY d_year, s_nation, p_category ORDER BY d_year, s_nation, p_category;
The query involves all tables in the star schema and places fairly tight dimensional filters on each dimension table. The SSB authors model the approximate number of rows each query will access and show that Query 4.2 requires access to less than 0.5% of the total rows, making it a great query to test whether Druid’s indexing delivers fast analytics.
Building the OLAP Index From Hive
Here is the Hive query used to build the OLAP index. This query and more is available on GitHub if you want to reproduce the entire benchmark.
CREATE DATABASE ssb_druid;USE ssb_druid;SET hive.druid.metadata.uri=JDBC:MYSQL://db.example.com/druid_benchmark;SET hive.druid.indexer.partition.size.max=9000000;SET hive.druid.indexer.memory.rownum.max=100000;SET hive.tez.container.size=16000;SET hive.tez.java.opts=-xmx10g -XX:maxdirectmemorysize=1024g -duser.timezone=”america/new_york”;SET hive.llap.execution.mode=none;CREATE TABLE ssb_druid_month stored BY ‘org.apache.hadoop.hive.druid.druidstoragehandler’ tblproperties ( “druid.datasource” = “ssb_druid_day”, “druid.segment.granularity” = “month”, “druid.query.granularity” = “day” ) AS SELECT cast(d_year || ‘-‘ || d_monthnuminyear || ‘-‘ || d_daynuminmonth AS timestamp) AS `__time`, cast(c_city AS string) c_city, cast(c_nation AS string) c_nation, cast(c_region AS string) c_region, cast(d_weeknuminyear AS string) d_weeknuminyear, cast(d_year AS string) d_year, cast(d_yearmonth AS string) d_yearmonth, cast(d_yearmonthnum AS string) d_yearmonthnum, cast(lo_discount AS string) lo_discount, cast(lo_quantity AS string) lo_quantity, cast(p_brand1 AS string) p_brand1, cast(p_category AS string) p_category, cast(p_mfgr
There are some important things to call out here:
- The segment granularity controls the granularity of physical partitions by time in Druid. Segments are atomic units in Druid. If you need to update data in Druid (for example, if you had bad upstream data), you will need to re-build the entire segment, so there is a trade-off between how many segments you have and the speed at which you can build a segment.
- The query granularity controls how aggressively to pre-aggregate data in Druid. Pre-aggregation is key to boosting performance. On the other hand, a time series query can never go finer-grained than the query granularity, so a balance needs to be struck between performance and flexibility. In our example, we use daily query granularity, meaning it would be impossible to query a sum of values during the hours of 5 to 7 PM.
- String values are automatically treated as dimensions and numeric values are automatically treated as measures. Aggregates like sum, min, max, etc. are automatically made for all measures. Building aggregates from Hive is extremely simple and lets you skip a lot of the routine work you often need to do in other tools.
Architecting the OLAP Index
How did we arrive at the query used to build the OLAP index? There is a systematic procedure:
- The union of all dimensions used by the SSB queries is included in the index.
- The union of all measures is included in the index. Notice that we pre-compute some products in the index.
- Druid requires a timestamp, so the date of the transaction is used as the timestamp.
You can see that building the index requires knowledge of the query patterns. Either an expert in the query patterns architects the index, or a tool is needed to analyze queries or to dynamically build indexes on the fly. A lot of time can be spent in this architecture phase gathering requirements, designing measures, and so on because changing your mind after the fact can be very difficult.
One important difference between reporting queries and BI queries is that reporting queries usually sort by some specific measure because they’re going to end up on a static medium like an email or a piece of paper. BI tools generally handle sorting on the client side where it is dynamic and flexible and generally don’t rely on the database to sort.
To better simulate the BI experience, a user would get from a BI tool like Tableau. We eliminated the
ORDER BY clauses from the 10/13 SSB queries that include them.
One other cosmetic adjustment was made to convert “between” predicates into two separate inequality predicates. Automatic handling of “between” clauses is an in-progress work item.
Let’s take a look at the re-written Q4.2 that targets our OLAP index:
SELECT d_year, s_nation, p_category, Sum(net_revenue) AS profit FROM ssb_druid WHERE c_region = ‘america’ AND s_region = ‘america’ AND ( d_year = ‘1997’ OR d_year = ‘1998’ ) AND ( p_mfgr = ‘mfgr#1’ OR p_mfgr = ‘mfgr#2’ ) GROUP BY d_year, s_nation, p_category;
The changes here are that we remove the ORDER BY and issue the query to the denormalized OLAP index directly. All the dimension filters remain the same. Although we issue this query to Hive, it is answered entirely from the Druid layer. The full text of the Hive SQL queries used for this test is available on GitHub.
Performance Results at Scale 1000
To benchmark the Hive/Druid integration, the SSB queries were run via JDBC through HiveServer2 and backed by Druid. We see interactive response times for all queries; the average runtime was 960 milliseconds with a minimum runtime of 481 milliseconds and a maximum of 2,700 milliseconds.
We encourage you to compare these results against independent numbers produced by AtScale in their document “The Business Intelligence for Hadoop Benchmark,” which also uses SSB at scale 1000.
For additional reference, here are the specifics of the cluster where these numbers were generated:
- 10 nodes
- 2x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz with 16 CPU threads each
- 256 GB RAM per node
- 6x WDC WD4000FYYZ-0 1K02 4TB SCSI disks per node
The GitHub repo also contains some additional tuning notes with detailed Java command line arguments.
It’s worth mentioning that the actual cube size on disk is around 200 GB, in large part due to pre-aggregation. The workload operates in-memory but could operate with a much smaller memory footprint.
Published at DZone with permission of Carter Shanklin, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Decoding ChatGPT: The Concerns We All Should Be Aware Of
Extending Java APIs: Add Missing Features Without the Hassle
A Complete Guide to Agile Software Development
Azure Virtual Machines