{{announcement.body}}
{{announcement.title}}

What Is High Cardinality?

DZone 's Guide to

What Is High Cardinality?

Look into what causes high cardinality and why it's a common problem when dealing with time series data.

· Database Zone ·
Free Resource

If you are working with a database, especially with time series data, then you have likely faced the challenge of handling high cardinality data.

In particular, time series high cardinality is a common problem in industrial IoT (e.g., manufacturing, oil and gas, utilities, etc.), as well as some monitoring and event data workloads.

High cardinality is also a topic that developers end up discussing a lot, and there’s often a lot of questions around it.

To clear up one common point of confusion: high cardinality has only become such a big issue in the time series world because of the limitations of some popular time series databases. In reality, high cardinality data is actually a solved problem, if one chooses the right database.

Let’s back up for a minute and first define high cardinality.

Image title

What Is High Cardinality?

Broadly defined, cardinality refers to the number of values in a set. Sometimes the cardinality of your set is small (low cardinality), and other times it can be large (high cardinality). For example, there are quite a few (delicious) M&Ms in our image above, but the cardinality of that dataset is quite small (6):

Image title

In the world of databases, cardinality refers to the number of unique values contained in a particular column, or field, of a database.

However, with time series data, things get a bit more complex.

Time series data tends to be paired with metadata (sometimes called “tags”) that describes that data. Often, that primary time series data or the metadata is indexed for faster query performance so that you can quickly find the values that match all of the specified tags.

The cardinality of a time series dataset is typically defined by the cross-product of the cardinality of each individual indexed column. So if there are 6 colors of M&Ms, but also 5 types of M&Ms (plain, peanut, almond, pretzel, and crispy), then there our cardinality is now 6x5 = 30 total options for M&Ms. Having the right indexes would then allow us to efficiently find all blue, crispy M&Ms (which are objectively the best).

If you have multiple indexed columns, each with a large number of unique values, then the cardinality of that cross product can get really large. That’s what software developers typically mean when they talk about a time series dataset with “high cardinality.”

Let’s look at an example.

High Cardinality Example: Industrial IoT

Imagine an IoT scenario where there are large, heavy pieces of equipment mining rock, crushing rock, and sorting rock in a certain quarry.

Say there are 10,000 pieces of equipment, each with 100 sensors, running 10 different firmware versions, spread across 100 sites:

timestamp

temperature

mem_free

equipment_id

sensor_id

firmware_version

site_id

(lat,long)

2019-04-04 09:00:00

85.2

10.2

1

98

1.0

4

(x,y)

2019-04-04 09:00:00

68.8

16.0

72

12

1.1

20

(x1,y1)

2019-04-04 09:00:00

100.0

0.0

34

58

2.1

55

(x2,y2)

2019-04-04 09:00:00

84.8

9.8

12

75

1.4

81

(x3,y3)

2019-04-04 09:00:00

68.7

16.0

89

4

2.1

13

(x4,y4)

...








The maximum cardinality of this dataset then becomes 1 billion [10,000 x 100 x 10 x 100].

Now imagine that the equipment can move as well, and we’d like to store the precise GPS location (lat, long) and use that as indexed metadata to query by. Because (lat, long) is a continuous field (as opposed to a discrete field like equipment_id), by indexing on location, the max cardinality of this dataset is now infinitely large (unbounded).

How a Relational Database Designed for Time Series Would Handle High Cardinality

Different databases take various approaches to handling high cardinality. Ultimately, a database’s performance when working with a high cardinality dataset can be traced back to how it was engineered from the start.

If you are working with a lot of time series data and using a relational database, one proven data structure for indexing data is the B-tree.

Relying on the B-tree data structure for indexing data has several benefits for high cardinality datasets:

  • You can achieve a clear understanding of how the database performs. As long as the indexes and data for the dataset you want to query fit inside memory, which is something that can be tuned, cardinality becomes a non-issue.
  • You have control over which columns to index, including the ability to create compound indexes over multiple columns. You can also add or delete indexes anytime you want, for example, if your query workloads change.
  • You can create indexes on discrete and continuous fields, particularly because B-trees work well for a comparison using any of the following operators: <, <=, =, >=, >, BETWEEN, IN, IS NULL, IS NOT NULL. Our example queries from above (“SELECT * FROM sensor_data WHERE mem_free = 0” and “SELECT * FROM sensor_data WHERE temperature > 90”) will run in logarithmic, or O(log n), time.

While there are other approaches that time series databases use for high cardinality, using a B-tree structure has proven to be reliable. (For more on how two popular time series databases compare, click here.)

If you are having problems with high cardinality data, please feel free to leave a comment below.

Topics:
cardinality ,cardinality estimation ,time series ,time series database ,industrial iot ,relational database

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}