Originally Written by Justin Swanhart
There are a lot of tools that generate test data. Many of them have complex XML scripts or GUI interfaces that let you identify characteristics about the data. For testing query performance and many other applications, however, a simple quick and dirty data generator which can be constructed at the MySQL command line is useful.
First, let’s talk about what kind of data you can easily create with MySQL function calls:
You can generate a decimal number between zero and another number using the MySQL RAND() function like the following query (here between 0 and 10000):
SELECT RAND() * 10000;
Similarly, you can generate a random integer by adding the FLOOR() function:
SELECT FLOOR(RAND() * 10000)
You can generate a random string of 32 characters using MD5():
SELECT MD5(RAND() * 10000)
You can return a random integer between 500 and 1000 with the following:
SELECT FLOOR( 500 + RAND() * (1000 - 500))
You can return a random string from a list of strings by using a table to hold the list. A subselect can select a random name from the list of names.
create table names(id int auto_increment primary key, name varchar(20)); insert into names (name) values ('Justin','Jerry','James','Josh','Julien'); select (select name from names where id = 1 + rand() * 4);
Now we can generate a “fact” table with many rows using fairly simple SQL statements.
First create a table to generate data into:
CREATE TABLE fact ( dim1 int, dim2 int, name varchar(20), hash varchar(32), measure1 double );
Seed the table with one initial row:
INSERT INTO fact VALUES (1,1,'Justin',md5(''), .1);
Now grow the table by selecting from the table but providing new random values for the inserted rows:
INSERT INTO fact SELECT FLOOR(1+ rand()*9999), FLOOR(1 + rand()*499), (select name from names where id = 1 + rand() * 4), MD5(1+rand()*9999), rand() FROM fact;
As you repeat the INSERT … SELECT, the table will grow exponentially. You may want to add a LIMIT clause to the INSERT … SELECT to reduce the amount of data generated as the table grows.
You will create a table with an even data distribution for each column. You can then add some queries to add skew, either using INSERT … SELECT or UPDATE, for example:
INSERT INTO fact SELECT 1,1,'Justin',md5(''), .1 FROM fact LIMIT 10000;
That will skew the values by creating many rows with the same data as our initial row.
Using these simple tools, you can generate a data set that is great for testing purposes. For example, dim1 might be a customer_id and dim2 a product_id, and you would populate those tables with 10000 and 500 rows, respectively.