What Is High Cardinality?
Look into what causes high cardinality and why it's a common problem when dealing with time series data.
Join the DZone community and get the full member experience.Join For Free
High-cardinality refers to columns with values that are very uncommon or unique. High-cardinality column values are typically identification numbers, email addresses, or user names. An example of a data table column with high cardinality would be a USERS table with a column named USER_ID. This column would contain unique values of 1-n. Each time a new user is created in the USERS table, a new number would be created in the USER_ID column to identify them uniquely. Since the values held in the USER_ID column are unique, this column's cardinality type would be referred to as high-cardinality.
If you are working with a database, especially with time series data, then you have likely faced the challenge of handling high cardinality data.
In particular, time series high cardinality is a common problem in industrial IoT (e.g., manufacturing, oil and gas, utilities, etc.), as well as some monitoring and event data workloads.
High cardinality is also a topic that developers end up discussing a lot, and there’s often a lot of questions around it.
To clear up one common point of confusion: high cardinality has only become such a big issue in the time series world because of the limitations of some popular time series databases. In reality, high cardinality data is actually a solved problem, if one chooses the right database.
Let’s back up for a minute and first define high cardinality.
What Is High Cardinality?
Broadly defined, cardinality refers to the number of values in a set. Sometimes the cardinality of your set is small (low cardinality), and other times it can be large (high cardinality). For example, there are quite a few (delicious) M&Ms in our image above, but the cardinality of that dataset is quite small (6):
In the world of databases, cardinality refers to the number of unique values contained in a particular column, or field, of a database.
However, with time series data, things get a bit more complex.
Time series data tends to be paired with metadata (sometimes called “tags”) that describes that data. Often, that primary time series data or the metadata is indexed for faster query performance so that you can quickly find the values that match all of the specified tags.
The cardinality of a time series dataset is typically defined by the cross-product of the cardinality of each individual indexed column. So if there are 6 colors of M&Ms, but also 5 types of M&Ms (plain, peanut, almond, pretzel, and crispy), then there our cardinality is now 6x5 = 30 total options for M&Ms. Having the right indexes would then allow us to efficiently find all blue, crispy M&Ms (which are objectively the best).
If you have multiple indexed columns, each with a large number of unique values, then the cardinality of that cross product can get really large. That’s what software developers typically mean when they talk about a time series dataset with “high cardinality.”
Let’s look at an example.
High Cardinality Example: Industrial IoT
Imagine an IoT scenario where there are large, heavy pieces of equipment mining rock, crushing rock, and sorting rock in a certain quarry.
Say there are 10,000 pieces of equipment, each with 100 sensors, running 10 different firmware versions, spread across 100 sites:
The maximum cardinality of this dataset then becomes 1 billion [10,000 x 100 x 10 x 100].
Now imagine that the equipment can move as well, and we’d like to store the precise GPS location (lat, long) and use that as indexed metadata to query by. Because (lat, long) is a continuous field (as opposed to a discrete field like equipment_id), by indexing on location, the max cardinality of this dataset is now infinitely large (unbounded).
How a Relational Database Designed for Time Series Would Handle High Cardinality
Different databases take various approaches to handling high cardinality. Ultimately, a database’s performance when working with a high cardinality dataset can be traced back to how it was engineered from the start.
If you are working with a lot of time series data and using a relational database, one proven data structure for indexing data is the B-tree.
Relying on the B-tree data structure for indexing data has several benefits for high cardinality datasets:
- You can achieve a clear understanding of how the database performs. As long as the indexes and data for the dataset you want to query fit inside memory, which is something that can be tuned, cardinality becomes a non-issue.
- You have control over which columns to index, including the ability to create compound indexes over multiple columns. You can also add or delete indexes anytime you want, for example, if your query workloads change.
- You can create indexes on discrete and continuous fields, particularly because B-trees work well for a comparison using any of the following operators: <, <=, =, >=, >, BETWEEN, IN, IS NULL, IS NOT NULL. Our example queries from above (“SELECT * FROM sensor_data WHERE mem_free = 0” and “SELECT * FROM sensor_data WHERE temperature > 90”) will run in logarithmic, or O(log n), time.
While there are other approaches that time series databases use for high cardinality, using a B-tree structure has proven to be reliable. (For more on how two popular time series databases compare, click here.)
If you are having problems with high cardinality data, please feel free to leave a comment below.
Published at DZone with permission of Ajay Kulkarni. See the original article here.
Opinions expressed by DZone contributors are their own.