DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Why Should Databases Go Natural?
  • SQL Interview Preparation Series: Mastering Questions and Answers Quickly
  • Keep Calm and Column Wise
  • Why SQL Isn’t the Right Fit for Graph Databases

Trending

  • Beyond Linguistics: Real-Time Domain Event Mapping with WebSocket and Spring Boot
  • Streamlining Event Data in Event-Driven Ansible
  • Unlocking AI Coding Assistants Part 1: Real-World Use Cases
  • How to Practice TDD With Kotlin
  1. DZone
  2. Data Engineering
  3. Databases
  4. Scalable Select of Random Rows in SQL

Scalable Select of Random Rows in SQL

Let's explore a scalable select of random rows in SQL.

By 
Pavel Tiunov user avatar
Pavel Tiunov
·
Apr. 22, 19 · Opinion
Likes (1)
Comment
Save
Tweet
Share
8.7K Views

Join the DZone community and get the full member experience.

Join For Free

If you’re new to the big data world and also migrating from tools like Google Analytics or Mixpanel for your web analytics, you probably noticed performance differences. Google Analytics can show you predefined reports in seconds, while the same query for the same data in your data warehouse can take several minutes or even more.

Such performance boosts are achieved by selecting random rows or the sampling technique. We see this technique is used quite frequently during deployments of Cube.js related to massive amounts of data. Let’s learn how to select random rows in SQL.

Sample and Population

Before we start to work on sampling implementation, it is worth mentioning some sampling fundamentals. Sampling is based on a subset selection of individuals from some population to describe this population’s properties. So if you have some event data, you can select a subset of unique users and their events to calculate metrics that describe all users’ behavior. On the other hand, if you select a subset of events, it won’t describe the behavior of the population precisely.

Given this fact, we assume all data sets to which sampling is applied have unique user identifiers. Moreover, if you have both anonymous user identifiers and authorized user identifiers, we assume using the last one as it delivers more accurate results.

Selecting Random Rows in SQL

There are plenty of different methods you can use to select a subset of users. We’ll consider only two of those as the most obvious and simple to implement: simple random sampling and systematic sampling.

Simple random sampling can be implemented as giving a unique number to each user in a range from 0 to N-1 and then selecting X random numbers from 0 to N-1. N denotes the total number of users here and X is the sample size. Although this method is very simple for understanding, it’s a little bit tricky to implement in an SQL environment, mostly because random number generator outputs don’t scale very well when sample sizes are equal to billions. Also, it isn’t very clear as to how to get evenly distributed samples over time.

Bearing this in mind, we’ll use systematic sampling which can overcome these obstacles from an SQL implementation perspective. Simple systematic sampling may be implemented as selecting one user from each M users at a specified interval. In this case, the sample size will be equal to N/M. Selecting one user out of M while preserving uniform distribution across sample buckets is the main challenge of this approach. Let’s see how this can be implemented in SQL.

Sequence Generated User Identifiers

You’re lucky if your user IDs are integers generated as a strict sequence without gaps like those generated by AUTO_INCREMENT primary key fields. In this case, you can implement systematic sampling as simple as:

select * from events where ABS(MOD(user_id, 10)) = 7

This SQL query and all SQL queries below are in Standard BigQuery SQL. In this example, we’re selecting one user out of 10, which is a 10 percent sample. 7 is the random number of the sampling bucket and it can be any number from 0 to 9. We use MOD operation to create sampling buckets which stand for the remainder of a division by 10 in this particular case. It’s really simple to show that if user_id is a strict integer sequence, then user counts are uniformly distributed across all sampling buckets when user count is high enough.

To estimate the event count, for example, you can write something like:

select count(*) * 10 as events_count from events where ABS(MOD(user_id, 10)) = 7

Please note the multiplication by 10 in this query. Given the fact that we use a 10 percent sample, all estimated additive measures should be scaled by 1/10 percent in order to match real values.

One can question how precise this sampling approach is for a specific dataset. You can estimate it by checking how uniform the distribution is within sampling buckets. To do that you can query something like:

select ABS(MOD(user_id, 10)), count(distinct user_id) from events GROUP BY 1

Let’s consider some example results of this query:

0 115617904
1 115639491
2 115635158
3 115631186
4 115614455
5 115636846
6 115644374
7 115652641
8 115648788
9 115654729

We can calculate the mean and confidence interval for an alpha coefficient equal to 0.01 for these sampling bucket sizes. The confidence interval will be equal to 0.01 percent of the sampling bucket average size. This means that with 99 percent probability the sampling bucket sizes differ not more than 0.01 percent. Different metrics calculated with these sampling buckets couple with this statistic but don’t inherit it. So, in order to calculate precision for event count estimation, you can calculate the event count for each sample as:

select ABS(MOD(user_id, 10)), count(*) * 10 from events GROUP BY 1

Then, calculate absolute and relative confidence intervals for these event counts as in the case of user count to get event count estimate precision.

String Identifiers and Other User Identifiers

In the case that you have string user identifiers or integers but not a strict sequence instead of integer sequence identifiers, you need a way to distribute all user IDs between different sampling buckets in a uniform manner. This can be done by hash functions. Not all hash functions can get you uniform distribution under different circumstances. You can check smhasher test suite results to check how good a particular hash function is at this.

For example, in BigQuery you can use the FARM_FINGERPRINT hash function to prepare a sample for select:

select * from events where ABS(MOD(FARM_FINGERPRINT(string_user_id), 10)) = 7

FARM_FINGERPRINT can be replaced with any suitable hash function, such as xxhash64 in Presto or even the combination of md5 and strtol in Redshift.

As in the case of sequence-generated user identifiers, you can check uniformity statistics as:

select ABS(MOD(FARM_FINGERPRINT(string_user_id), 10)), count(distinct string_user_id) from events GROUP BY 1

Common Pitfalls

Most often, sampling can reduce your SQL query execution time by more than 5-10 times without being harmful in terms of precision, but there are still cases where you should be cautious. This mostly belongs to a sampling bias problem because of sample size. It can occur when the dispersion of metrics you’re interested in between samples is too high even if the sample size is big enough.

For example, you can be interested in some rare event count such as an enterprise demo request being some B2C site with a huge amount of traffic. If the enterprise demo request count is about 100 events per month and your monthly active users is about 1M, then sampling can lead you in a situation where the enterprise demo request count estimate with a 10 percent sample produces significant errors. Generally speaking, sampling random rows in SQL should be avoided in this case, or more sophisticated methods should be used instead.

sql Database Relational database

Opinions expressed by DZone contributors are their own.

Related

  • Why Should Databases Go Natural?
  • SQL Interview Preparation Series: Mastering Questions and Answers Quickly
  • Keep Calm and Column Wise
  • Why SQL Isn’t the Right Fit for Graph Databases

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!