Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Nested Data Structures in ClickHouse

DZone's Guide to

Nested Data Structures in ClickHouse

Sometimes, you probably want to store unstructured information in structured database. Don't know how to do this? Never fear—nested data structures to the rescue!

· Database Zone
Free Resource

Find out how Database DevOps helps your team deliver value quicker while keeping your data safe and your organization compliant. Align DevOps for your applications with DevOps for your SQL Server databases to discover the advantages of true Database DevOps, brought to you in partnership with Redgate

In this blog post, we’ll look at nested data structures in ClickHouse for MySQL and how they can be used with PMM to look at queries.

Nested structures are not common in Relational Database Management Systems. Usually, it’s just flat tables. Sometimes, it would be convenient to store unstructured information in structured databases.

We are working to adapt ClickHouse as a long term storage for Percona Monitoring and Management (PMM) and particularly to store detailed information about queries. One of the problems we are trying to solve is to count the different errors that cause a particular query to fail.

For example, for date 2017-08-17, the query:

"SELECT foo FROM bar WHERE id=?"

...was executed 1,000 times. 25 times, it failed with error code 1212 and eight times, it failed with error code 1250. Of course, the traditional way to store this in relational data would be to have a table "Date, QueryID, ErrorCode, ErrorCnt" and then perform a JOIN to this table. Unfortunately, columnar databases don’t perform well with multiple joins, and often the recommendation is to have de-normalized tables.

We can create a column for each possible ErrorCode, but this is not an optimal solution. There could be thousands of them, and most of the time, they would be empty.

In this case, ClickHouse proposes Nested data structures. For our case, these can be defined as:

CREATE TABLE queries
(
    Period Date,
    QueryID UInt32,
    Fingerprint String,
    Errors Nested
    (
        ErrorCode String,
        ErrorCnt UInt32
    )
)Engine=MergeTree(Period,QueryID,8192);

This solution has obvious questions: How do we insert data into this table? How do we extract it?

Let’s start with INSERT. This can look like:

INSERT INTO queries VALUES ('2017-08-17',5,'SELECT foo FROM bar WHERE id=?',['1220','1230','1212'],[5,6,2])

...which means that the inserted query during 2017-08-17 gave error 1220 five times, error 1230 six times and error 1212 two times.

Now, during a different date, it might produce different errors:

INSERT INTO queries VALUES ('2017-08-18',5,'SELECT foo FROM bar WHERE id=?',['1220','1240','1258'],[3,2,1])

Let’s take a look at ways to SELECT data. A very basic SELECT:

SELECT *
FROM queries
┌─────Period─┬─QueryID─┬─Fingerprint─┬─Errors.ErrorCode───────┬─Errors.ErrorCnt─┐
│ 2017-08-17 │       5 │ SELECT foo  │ ['1220','1230','1212'] │ [5,6,2]         │
│ 2017-08-18 │       5 │ SELECT foo  │ ['1220','1240','1260'] │ [3,16,12]       │
└────────────┴─────────┴─────────────┴────────────────────────┴─────────────────┘

If we want to use a more familiar tabular output, we can use the ARRAY JOIN extension:

SELECT *
FROM queries
ARRAY JOIN Errors
┌─────Period─┬─QueryID─┬─Fingerprint─┬─Errors.ErrorCode─┬─Errors.ErrorCnt─┐
│ 2017-08-17 │       5 │ SELECT foo  │ 1220             │            5    │
│ 2017-08-17 │       5 │ SELECT foo  │ 1230             │            6    │
│ 2017-08-17 │       5 │ SELECT foo  │ 1212             │            2    │
│ 2017-08-18 │       5 │ SELECT foo  │ 1220             │            3    │
│ 2017-08-18 │       5 │ SELECT foo  │ 1240             │           16    │
│ 2017-08-18 │       5 │ SELECT foo  │ 1260             │           12    │
└────────────┴─────────┴─────────────┴──────────────────┴─────────────────┘

However, usually we want to see the aggregation over multiple periods, which can be done with traditional aggregation functions:

SELECT
    QueryID,
    Errors.ErrorCode,
    SUM(Errors.ErrorCnt)
FROM queries
ARRAY JOIN Errors
GROUP BY
    QueryID,
    Errors.ErrorCode
┌─QueryID─┬─Errors.ErrorCode─┬─SUM(Errors.ErrorCnt)─┐
│       5 │ 1212             │                 2    │
│       5 │ 1230             │                 6    │
│       5 │ 1260             │                12    │
│       5 │ 1240             │                16    │
│       5 │ 1220             │                 8    │
└─────────┴──────────────────┴──────────────────────┘

If we want to get really creative and return only one row per QueryID, we can do that, as well:

SELECT
    QueryID,
    groupArray((ecode, cnt))
FROM
(
    SELECT
        QueryID,
        ecode,
        sum(ecnt) AS cnt
    FROM queries
    ARRAY JOIN
        Errors.ErrorCode AS ecode,
        Errors.ErrorCnt AS ecnt
    GROUP BY
        QueryID,
        ecode
)
GROUP BY QueryID
┌─QueryID─┬─groupArray(tuple(ecode, cnt))──────────────────────────────┐
│       5 │ [('1230',6),('1212',2),('1260',12),('1220',8),('1240',16)] │
└─────────┴────────────────────────────────────────────────────────────┘

Conclusion

ClickHouse provides flexible ways to store data in a less structured manner and variety of functions to extract and aggregate it — despite being a columnar database.

Happy data warehousing!

Align DevOps for your applications with DevOps for your SQL Server databases to increase speed of delivery and keep data safe. Discover true Database DevOps, brought to you in partnership with Redgate

Topics:
nested data ,database ,data warehousing ,tutorial ,clickhouse ,rdbms

Published at DZone with permission of Vadim Tkachenko, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}