Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Implementing Subtotal Aggregations in ClickHouse

DZone's Guide to

Implementing Subtotal Aggregations in ClickHouse

Learn how to implement subtotal aggregations and sub-aggregations in ClickHouse, an open-source database for big data analytics.

Free Resource

Running out of memory? Learn how Redis Enterprise enables large dataset analysis with the highest throughput and lowest latency while reducing costs over 75%! 

ClickHouse is a blazing fast open-source analytical database for big data volumes. It was open-sourced by Yandex in middle of 2016 and immediately became popular in Russia and abroad.

Aggregating the Clicklog

One of the typical appliances for an analytical database is getting a dataset grouped by some dimensions. For example, we have a clicklog of users:

dt ts user_id country action
2017-12-01 2017-12-01 00:00:00 100500 UK view
2017-12-01 2017-12-01 00:03:00 100500 UK click

The common report that we could possibly create from this table asks: How is the user's activity spread by country here? It is pretty simple simple and can be executed on any RDBMS:

SELECT uniq(user_id) as user_count, country
FROM clicklog
WHERE dt BETWEEN '2017-12-01' AND '2017-12-10'
GROUP by country
user_count country
100 UK
200 FR

And what if we would like to get the total for all the countries? In ClickHouse, there is an instruction WITH TOTALS that allows getting the totals for each grouping dimension. The modified query looks like this:

SELECT uniq(user_id) as user_count, country
FROM clicklog
WHERE dt BETWEEN '2017-12-01' AND '2017-12-10'
GROUP by country
WITH TOTALS
user_count country
100 UK
200 FR
300 (empty string)

Diving Into Sub-Aggregations

What if we want to aggregate by one more dimension and get subtotals on each dimension? These aggregations are usually called rollups, cubes, or grouping sets and are supported by different analytical systems (Hive/Presto/MS SQL/Oracle, etc.). Unfortunately, ClickHouse currently does not support these instructions (they are planned to be implemented sometime later). But there are two workarounds that can be applied.

UNION ALL Clause

In general, this case can be applied to any RDBMS. The solution is to union results of different aggregation queries:

-- By two dimensions
SELECT uniq(user_id) as user_count, country, action
FROM clicklog
GROUP BY country, action
-- By country
UNION ALL
SELECT uniq(user_id) as user_count, country, '' as action
FROM clicklog
GROUP BY country, action
-- By action
UNION ALL
SELECT uniq(user_id) as user_count, '' as country, action
FROM clicklog
GROUP BY country, action
-- total
UNION ALL
SELECT uniq(user_id) as user_count, '' as country, '' as action
FROM clicklog
GROUP BY country, action

Looks a bit scary. It will perform four scans over the clicklog table. In ClickHouse, it will cause four parallel queries to the distributed table (a "view" over all of the cluster machines).

FlatMap of the Row

Aggregations in subtotals can be compared with MapReduce. In terms of MR, row operations look like this:

  1. FlatMap one table row in multiple keys ((country, action), (country), (action), ())

  2. Group by each key

In most analytic databases, you can only perform filter and one-to-one transformations and reduces (GROUP BY). But ClickHouse provides the FlatMap operation out-of-the-box!

This operator is called ARRAY JOIN. Basically, it works the following way:

SELECT arrayJoin([1, 2, 3]) AS value
value
1
2
3

This function gives us the possibility to create multiple rows from a single one.

When applied to the user table:

user_id gender country
100500 F UK
SELECT * 
FROM user
ARRAY JOIN ['one','two','three'] as alias

It will give the following resultset:

user_id gender country alias
100500
F
UK one
100500 F UK two
100500 F UK three

Final Step

To get all our sub-aggregations, we need to expose grouping keys in the ARRAY JOIN section:

SELECT 
    uniq(user_id) AS user_count, 
    key
FROM clicklog 
ARRAY JOIN [
  ('', ''), 
  (country, ''), 
  ('', action), 
  (country, action)
] AS key
GROUP BY key
user_count key
100 ['','']
50 ['UK', '']
50 ['FR', '']
50 ['','buy']
50 ['','view']
25 ['UK','buy']
25 ['UK','view']
25 ['FR', 'buy]

And, final polishing to get rid of Tuple(String, String) as a column type:

SELECT 
    user_count, 
    key[1] AS country,
    key[2] AS action
FROM 
(
    SELECT 
        uniq(user_id) AS user_count, 
        key
    FROM clicklog_str 
    ARRAY JOIN [
      ('', ''), 
      (country, ''), 
      ('', action), 
      (country, action)
    ] AS key
    GROUP BY key
) 

Conclusion

We got a solution for subtotals aggregation that performs a single scan over the data. Actually, the ClickHouse team has announced built-in support for grouping sets in future, but we can use it already now!

Running out of memory? Never run out of memory with Redis Enterprise databaseStart your free trial today.

Topics:
clickhouse ,sql ,database ,data analytics ,clicklog ,aggregation ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}