DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Navigating the Divide: Distinctions Between Time Series Data and Relational Data
  • NoSQL for Relational Minds
  • Unveiling the Clever Way: Converting XML to Relational Data
  • Keep Calm and Column Wise

Trending

  • A Developer's Guide to Mastering Agentic AI: From Theory to Practice
  • Top Book Picks for Site Reliability Engineers
  • Breaking Bottlenecks: Applying the Theory of Constraints to Software Development
  • Google Cloud Document AI Basics
  1. DZone
  2. Data Engineering
  3. Databases
  4. What Is High Cardinality?

What Is High Cardinality?

Look into what causes high cardinality and why it's a common problem when dealing with time series data.

By 
Ajay Kulkarni user avatar
Ajay Kulkarni
·
Updated Jun. 21, 22 · Opinion
Likes (8)
Comment
Save
Tweet
Share
63.8K Views

Join the DZone community and get the full member experience.

Join For Free

High-cardinality refers to columns with values that are very uncommon or unique. High-cardinality column values are typically identification numbers, email addresses, or user names. An example of a data table column with high cardinality would be a USERS table with a column named USER_ID. This column would contain unique values of 1-n. Each time a new user is created in the USERS table, a new number would be created in the USER_ID column to identify them uniquely. Since the values held in the USER_ID column are unique, this column's cardinality type would be referred to as high-cardinality.

If you are working with a database, especially with time series data, then you have likely faced the challenge of handling high cardinality data.

In particular, time series high cardinality is a common problem in industrial IoT (e.g., manufacturing, oil and gas, utilities, etc.), as well as some monitoring and event data workloads.

High cardinality is also a topic that developers end up discussing a lot, and there’s often a lot of questions around it.

To clear up one common point of confusion: high cardinality has only become such a big issue in the time series world because of the limitations of some popular time series databases. In reality, high cardinality data is actually a solved problem, if one chooses the right database.

Let’s back up for a minute and first define high cardinality.

Image title


What Is High Cardinality?

Broadly defined, cardinality refers to the number of values in a set. Sometimes the cardinality of your set is small (low cardinality), and other times it can be large (high cardinality). For example, there are quite a few (delicious) M&Ms in our image above, but the cardinality of that dataset is quite small (6):

Image title


In the world of databases, cardinality refers to the number of unique values contained in a particular column, or field, of a database.

However, with time series data, things get a bit more complex.

Time series data tends to be paired with metadata (sometimes called “tags”) that describes that data. Often, that primary time series data or the metadata is indexed for faster query performance so that you can quickly find the values that match all of the specified tags.

The cardinality of a time series dataset is typically defined by the cross-product of the cardinality of each individual indexed column. So if there are 6 colors of M&Ms, but also 5 types of M&Ms (plain, peanut, almond, pretzel, and crispy), then there our cardinality is now 6x5 = 30 total options for M&Ms. Having the right indexes would then allow us to efficiently find all blue, crispy M&Ms (which are objectively the best).

If you have multiple indexed columns, each with a large number of unique values, then the cardinality of that cross product can get really large. That’s what software developers typically mean when they talk about a time series dataset with “high cardinality.”

Let’s look at an example.


High Cardinality Example: Industrial IoT

Imagine an IoT scenario where there are large, heavy pieces of equipment mining rock, crushing rock, and sorting rock in a certain quarry.

Say there are 10,000 pieces of equipment, each with 100 sensors, running 10 different firmware versions, spread across 100 sites:


timestamp

temperature

mem_free

equipment_id

sensor_id

firmware_version

site_id

(lat,long)

2019-04-04 09:00:00

85.2

10.2

1

98

1.0

4

(x,y)

2019-04-04 09:00:00

68.8

16.0

72

12

1.1

20

(x1,y1)

2019-04-04 09:00:00

100.0

0.0

34

58

2.1

55

(x2,y2)

2019-04-04 09:00:00

84.8

9.8

12

75

1.4

81

(x3,y3)

2019-04-04 09:00:00

68.7

16.0

89

4

2.1

13

(x4,y4)

...









The maximum cardinality of this dataset then becomes 1 billion [10,000 x 100 x 10 x 100].

Now imagine that the equipment can move as well, and we’d like to store the precise GPS location (lat, long) and use that as indexed metadata to query by. Because (lat, long) is a continuous field (as opposed to a discrete field like equipment_id), by indexing on location, the max cardinality of this dataset is now infinitely large (unbounded).


How a Relational Database Designed for Time Series Would Handle High Cardinality

Different databases take various approaches to handling high cardinality. Ultimately, a database’s performance when working with a high cardinality dataset can be traced back to how it was engineered from the start.

If you are working with a lot of time series data and using a relational database, one proven data structure for indexing data is the B-tree.


Relying on the B-tree data structure for indexing data has several benefits for high cardinality datasets:

  • You can achieve a clear understanding of how the database performs. As long as the indexes and data for the dataset you want to query fit inside memory, which is something that can be tuned, cardinality becomes a non-issue.
  • You have control over which columns to index, including the ability to create compound indexes over multiple columns. You can also add or delete indexes anytime you want, for example, if your query workloads change.
  • You can create indexes on discrete and continuous fields, particularly because B-trees work well for a comparison using any of the following operators: <, <=, =, >=, >, BETWEEN, IN, IS NULL, IS NOT NULL. Our example queries from above (“SELECT * FROM sensor_data WHERE mem_free = 0” and “SELECT * FROM sensor_data WHERE temperature > 90”) will run in logarithmic, or O(log n), time.

While there are other approaches that time series databases use for high cardinality, using a B-tree structure has proven to be reliable. (For more on how two popular time series databases compare, click here.)

If you are having problems with high cardinality data, please feel free to leave a comment below.

Cardinality (data modeling) Relational database Database Time series Data (computing)

Published at DZone with permission of Ajay Kulkarni. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Navigating the Divide: Distinctions Between Time Series Data and Relational Data
  • NoSQL for Relational Minds
  • Unveiling the Clever Way: Converting XML to Relational Data
  • Keep Calm and Column Wise

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!