Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

SQL Test Data and Non-Evenly Distributed Randoms

DZone's Guide to

SQL Test Data and Non-Evenly Distributed Randoms

· Java Zone
Free Resource

Build vs Buy a Data Quality Solution: Which is Best for You? Gain insights on a hybrid approach. Download white paper now!

Need to generate test data in your SQL database? The team over at Periscope has had a couple of blog posts recently reminding us that an evenly random distribution is not always the most useful solution.

As pointed out in their first post on the matter, Beyond Random() — Normal Distributions in SQL, even distributions rarely simulate actual data. A more realistic distribution is the normal distribution, for which the folks at Periscope recommend using the Marsaglia Polar Method, which "converts a pair of uniformly distributed random numbers into a pair of normally distributed random numbers." In the post, they show the steps for using SQL to input random numbers using generate_series into the Marsaglia formulas:


This formula creates a Gaussian bell curve like this:

(Credit: Periscope.io)

In a subsequent blog post, Periscope goes over another distribution type: the Poisson Distribution. They explain the Poisson distribution like this:

Let's say you typically sell 5 widgets per day. How likely is it that you'll sell 5 widgets tomorrow? What about between 4 and 6 widgets tomorrow? Obviously we can't just guess randomly. And the normal distribution won't help either.

Fortunately, this is what the Poisson Distribution is for. Its formula is:

Our Poisson Distribution formula takes 3 inputs:

  • R: Our known rate, in this case 5.
  • e: Euler's Number, 2.71828.
  • k: tomorrow's expected rate.

This creates a distribution that looks like this:

(Credit: Periscope.io)

Periscope's blog entries both give specific details on using these distributions for test data in SQL. It's worth a look; you can check out their full blog at https://periscope.io/blog.


Build vs Buy a Data Quality Solution: Which is Best for You? Maintaining high quality data is essential for operational efficiency, meaningful analytics and good long-term customer relationships. But, when dealing with multiple sources of data, data quality becomes complex, so you need to know when you should build a custom data quality tools effort over canned solutions. Download our whitepaper for more insights into a hybrid approach.

Topics:

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}