DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • A Beginner's Guide to Machine Learning: What Aspiring Data Scientists Should Know
  • Learning AI/ML: The Hard Way
  • Python Packages for Data Science
  • The Art of Separation in Data Science: Balancing Ego and Objectivity

Trending

  • The Ultimate Guide to Code Formatting: Prettier vs ESLint vs Biome
  • Introduction to Retrieval Augmented Generation (RAG)
  • Intro to RAG: Foundations of Retrieval Augmented Generation, Part 1
  • How to Format Articles for DZone
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Scikit-Learn and More for Synthetic Dataset Generation for Machine Learning

Scikit-Learn and More for Synthetic Dataset Generation for Machine Learning

Why do you need a synthetic dataset and what are the essential features of a synthetic dataset for machine learning?

By 
Kevin Vu user avatar
Kevin Vu
·
Aug. 30, 19 · Analysis
Likes (4)
Comment
Save
Tweet
Share
13.8K Views

Join the DZone community and get the full member experience.

Join For Free

Image title

Synthetic dataset generation for machine learning

Synthetic Dataset Generation Using Scikit-Learn and More

It is becoming increasingly clear that the big tech giants such as Google, Facebook, and Microsoft are extremely generous with their latest machine learning algorithms and packages (they give those away freely) because the entry barrier to the world of algorithms is pretty low right now.

The open source community and tools (such as scikit-earn) have come a long way, and plenty of open source initiatives are propelling the vehicles of data science, digital analytics, and machine learning. Standing in 2019, we can safely say that algorithms, programming frameworks, and machine learning packages (or even tutorials and courses how to learn these techniques) are not the scarce resource but high-quality data is.

This often becomes a thorny issue on the side of the practitioners in data science (DS) and machine learning (ML) when it comes to tweaking and fine-tuning those algorithms. It will also be wise to point out, at the very beginning, that the current article pertains to the scarcity of data for algorithmic investigation, pedagogical learning, and model prototyping. It's not for scaling and running a commercial operation.

It is not a discussion about how to get quality data for the cool travel or fashion app you are working on. That kind of consumer, social, or behavioral data collection presents its own issues. However, even something as simple as having access to quality datasets for testing out the limitations and vagaries of a particular algorithmic method, often turns out, not so simple.

Why Do You Need a Synthetic Dataset?

If you are learning from scratch, the most sound advice would be to start with simple, small-scale datasets that you can plot in two dimensions to understand the patterns visually and see for yourself the working of the ML algorithm in an intuitive fashion.

As the dimensions of the data explode, however, the visual judgment must extend to more complicated matters — concepts like learning and sample complexity, computational efficiency, class imbalance, etc.

At this point, the trade-off between experimental flexibility and the nature of the dataset comes into play. You can always find yourself a real-life large dataset to practice the algorithm on. But that is still a fixed dataset, with a fixed number of samples, a fixed underlying pattern, and a fixed degree of class separation between positive and negative samples. You must also investigate:

  • How the chosen fraction of test and train data affects the algorithm’s performance and robustness
  • How robust the metrics are in the face of varying degree of class imbalance
  • What kind of bias-variance trade-offs must be made
  • How the algorithm performs under various noise signature in the training as well as test data (i.e. noise in the label as well as in the feature set)
  • How do you experiment and tease out the weakness of your ML algorithm?

Turns out that these are quite difficult to do with a single real-life dataset; therefore, you must be willing to work with synthetic data that is random enough to capture all the vagaries of a real-life dataset but controllable enough to help you scientifically investigate the strength and weakness of the particular ML pipeline you are building.

Although we won’t discuss the matter in this article, the potential benefit of such synthetic datasets can easily be gauged for sensitive applications — medical classifications or financial modeling, where getting hands on a high-quality labeled dataset is often expensive and prohibitive.

Essential Features of a Synthetic Dataset for ML

It is understood, at this point, that a synthetic dataset is generated programmatically and not sourced from any kind of social or scientific experiment, business transactional data, sensor reading, or manual labeling of images. However, such datasets are definitely not completely random, and the generation and usage of synthetic data for ML must be guided by some overarching needs. In particular,

  • It can be numeric, binary, or categorical (ordinal or non-ordinal), and the number of features and length of the dataset could be arbitrary
  • There must be some degree of randomness to it, but at the same time, the user should be able to choose a wide variety of statistical distribution to base this data upon i.e. the underlying random process can be precisely controlled and tuned
  • If it is used for classification algorithms, then the degree of class separation should be controllable to make the learning problem easy or hard
  • Random noise can be interjected in a controllable manner
  • Speed of generation should be quite high to enable experimentation with a large variety of such datasets for any particular ML algorithms i.e. if the synthetic data is based on data augmentation on a real-life dataset, then the augmentation algorithm must be computationally efficient
  • For a regression problem, a complex, non-linear generative process can be used for sourcing the data – real physics models may come to aid in this endeavor

In the next section, we will show how you are able to generate suitable datasets using some of the most popular ML libraries and programmatic techniques.

Standard Regression, Classification, and Clustering Dataset Generation Using Scikit Learn and Numpy

Scikit-learn is the most popular ML library in the Python-based software stack for data science. Apart from the well-optimized ML routines and pipeline building methods, it also boasts of a solid collection of utility methods for synthetic data generation.

Regression With Scikit Learn

Scikit-learn’s dataset.make_regression function can create random regression problems with an arbitrary number of input features, output targets, and controllable degree of informative coupling between them.

three regression graphs, displayed horizontally, with greater regression to the right.

Classification With Scikit-Learn

Similar to the regression function above, dataset.make_classification generates a random multi-class classification problem with controllable class separation and added noise. You can also randomly flip any percentage of output signs to create a harder classification dataset if you want.

Three classification graphs, displayed horizontally in a line.

Clustering With Scikit-Learn

A variety of clustering problems can be generated by scikit-learn utility functions. The most straightforward is to use the datasets.make_blobs, which generates an arbitrary number of clusters with controllable distance parameters.

three clustering graphs, displaying horizontally, and generated by Scikit-learn utility functions

For testing an affinity-based clustering algorithm or Gaussian mixture models, it is useful to have clusters generated in a special shape. We can use the datasets.make_circles function to accomplish that.

circular graphcircular graph to test affinity algorithms

For testing non-linear kernel methods with support vector machine (SVM) algorithms, nearest-neighbor methods like k-NN or even testing out a simple neural network is often advisable to experiment with certain shaped data. We can generate such data using the dataset.make_moon function with controllable noise.

graph to test non-linear kernel methods

Gaussian Mixture Model With Scikit-Learn

Gaussian mixture models (GMM) are fascinating objects to study for unsupervised learning and topic modeling in the text processing/NLP tasks. Here is an illustration of a simple function to show how easy it is to generate synthetic data for such a model:

  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. import random
  4. def gen_GMM(N=1000,n_comp=3, mu=[-1,0,1],sigma=[1,1,1],mult=[1,1,1]):
  5. """
  6. Generates a Gaussian mixture model data, from a given list of Gaussian components
  7. N: Number of total samples (data points)
  8. n_comp: Number of Gaussian components
  9. mu: List of mean values of the Gaussian components
  10. sigma: List of sigma (std. dev) values of the Gaussian components
  11. mult: (Optional) list of multiplier for the Gaussian components
  12. ) """
  13. assert n_comp == len(mu), "The length of the list of mean values does not match number of Gaussian components"
  14. assert n_comp == len(sigma), "The length of the list of sigma values does not match number of Gaussian components"
  15. assert n_comp == len(mult), "The length of the list of multiplier values does not match number of Gaussian components"
  16. rand_samples = []
  17. for i in range(N):
  18. pivot = random.uniform(0,n_comp)
  19. j = int(pivot)
  20. rand_samples.append(mult[j]*random.gauss(mu[j],sigma[j]))
  21. return np.array(rand_samples)

Beyond Scikit-Learn: Synthetic Data From Symbolic Input

While the aforementioned functions may be sufficient for many problems, the data generated is truly random, and the user has less control over the actual mechanics of the generation process. In many situations, one may require a controllable way to generate regression or classification problems based on a well-defined analytical function (involving linear, nonlinear, rational, or even transcendental terms).

The following article shows how you can combine the symbolic mathematics package SymPy and functions from SciPy to generate synthetic regression and classification problems from given symbolic expressions.

Random regression and classification problem generation with symbolic expression

Regression dataset generated from a given symbolic expression.

Regression dataset generated from a given symbolic expression.

Classification dataset generated from a given symbolic expression.

Classification dataset generated from a given symbolic expression.

Image Data Augmentation Using Scikit-Image

Deep learning systems and algorithms are voracious consumers of data. However, to test the limitations and robustness of a deep learning algorithm, one often needs to feed the algorithm with subtle variations of similar images. Scikit-image is an amazing image processing library built on the same design principle and API pattern as that of scikit-learn, offering hundreds of cool functions to accomplish this image data augmentation task.

We show some chosen examples of this augmentation process, starting with a single image and creating tens of variations on the same to effectively multiply the dataset manyfold and create a synthetic dataset of gigantic size to train deep learning models in a robust manner.


Hue, Saturation, Value Channels

four images, one showing an RGB image, and the others comparing changes applied to the image of hue, saturation, and value, respectively.

Cropping

Series of 6 images showing variations in cropping.

Random Noise

Series of 6 images showing 5 variations of image noise generation.

Rotation

A series of 6 images showing differences in image rotation

Swirl

series of four images, with variations in image swirling effects

Random Image Synthesizer With Segmentation

NVIDIA offers a UE4 plugin called NDDS to empower computer vision researchers to export high-quality synthetic images with metadata. It supports images, segmentation, depth, object pose, bounding box, keypoints, and custom stencils.

In addition to the exporter, the plugin includes various components enabling generation of randomized images for data augmentation and object detection algorithm training. The randomization utilities includes lighting, objects, camera position, poses, textures, and distractors. Together, these components allow deep learning engineers to easily create randomized scenes for training their CNN. Here is the Github link.

Categorical Data Generation Using pydbgen

Pydbgen is a lightweight, pure-python library to generate random useful entries (e.g. name, address, credit card number, date, time, company name, job title, license plate number, etc.) and save them in either Pandas dataframe object or as a SQLite table in a database file or in an MS Excel file. You can read the documentation here.

Here are a few illustrative examples:

illustrative example of random name generation

Synthesizing Time Series Dataset

There are quite a few papers and code repositories for generating synthetic time-series data using special functions and patterns observed in real-life multivariate time series. A simple example is given in this GitHub link.

Synthetic time series, shown over 9 images.

Synthetic Audio Signal Dataset

Audio/speech processing is a domain of particular interest for deep learning practitioners and ML enthusiasts. Google’s NSynth dataset is a synthetically generated (using neural autoencoders and a combination of human and heuristic labeling) library of short audio files sound made by musical instruments of various kinds. Here is the detailed description of the dataset.

Synthetic Environments for Reinforcement Learning

OpenAI Gym

The greatest repository for synthetic learning environment for reinforcement ML is OpenAI Gym. It consists of a large number of pre-programmed environments onto which users can implement their own reinforcement learning algorithms for benchmarking the performance or troubleshooting hidden weaknesses.

Series of 6 Images of OpenAI Gym

Random Grid World

For beginners in reinforcement learning, it often helps to practice and experiment with a simple grid world where an agent must navigate through a maze to reach a terminal state with a given reward/penalty for each step and the terminal states.

With a few simple lines of code, one can synthesize grid world environments with arbitrary size and complexity (with a user-specified distribution of terminal states and reward vectors).

Take a look at this GitHub repo for ideas and code examples.

Scikit-Learn and More for Synthetic Data Generation: Summary and Conclusions

In this article, we went over a few examples of synthetic data generation for machine learning. It should be clear to the reader that by no means do these represent the exhaustive list of data generating techniques. In fact, many commercial apps other than scikit-learn are offering the same service, as the need for training your ML model with a variety of data is increasing at a fast pace.

However, if you create your own programmatic method of synthetic data generation as a data scientist or ML engineer, it saves your organization money and resources to invest in a third-party app and also lets you plan the development of your ML pipeline in a holistic and organic fashion.

I hope you enjoyed this article and can start using some of the techniques described here in your own projects soon.

Further Reading

Using Scikit-Learn for Machine Learning Application Development in Python

Scikit-Learn vs. Machine Learning in R (mlr)

Machine learning Scikit-learn Data science Synthetic data Algorithm Deep learning Gaussian (software) Open source clustering

Published at DZone with permission of Kevin Vu. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • A Beginner's Guide to Machine Learning: What Aspiring Data Scientists Should Know
  • Learning AI/ML: The Hard Way
  • Python Packages for Data Science
  • The Art of Separation in Data Science: Balancing Ego and Objectivity

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!