DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Modify JSON Data in Postgres and Hibernate 6
  • Exploring JSON Schema for Form Validation in Web Components
  • Datafaker Gen: Leveraging BigQuery Sink on Google Cloud Platform
  • Exploring the Power of the Functional Programming Paradigm

Trending

  • Streamlining Event Data in Event-Driven Ansible
  • AI Meets Vector Databases: Redefining Data Retrieval in the Age of Intelligence
  • Recurrent Workflows With Cloud Native Dapr Jobs
  • Hybrid Cloud vs Multi-Cloud: Choosing the Right Strategy for AI Scalability and Security
  1. DZone
  2. Coding
  3. Java
  4. Flexible Data Generation With Datafaker Gen

Flexible Data Generation With Datafaker Gen

In this article, explore Datafaker Gen, a command-line tool designed to generate realistic data in various formats and sink to different destinations.

By 
Roman Rybak user avatar
Roman Rybak
·
Feb. 22, 24 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
6.7K Views

Join the DZone community and get the full member experience.

Join For Free

Introduction to Datafaker

Datafaker is a modern framework that enables JVM programmers to efficiently generate fake data for their projects using over 200 data providers allowing for quick setup and usage. Custom providers could be written when you need some domain-specific data. In addition to providers, the generated data can be exported to popular formats like CSV, JSON, SQL, XML, and YAML.

For a good introduction to the basic features, see "Datafaker: An Alternative to Using Production Data."

Datafaker offers many features, such as working with sequences and collections and generating custom objects based on schemas (see "Datafaker 2.0").

Bulk Data Generation

In software development and testing, the need to frequently generate data for various purposes arises, whether it's to conduct non-functional tests or to simulate burst loads. Let's consider a straightforward scenario when we have the task of generating 10,000 messages in JSON format to be sent to RabbitMQ.

From my perspective, these options are worth considering:

  1. Developing your own tool: One option is to write a custom application from scratch to generate these records(messages). If the generated data needs to be more realistic, it makes sense to use Datafaker or JavaFaker.
  2. Using specific tools: Alternatively, we could select specific tools designed for particular databases or message brokers. For example, tools like voluble for Kafka provide specialized functionalities for generating and publishing messages to Kafka topics; or a more modern tool like ShadowTraffic, which is currently under development and directed towards a container-based approach, which may not always be necessary.
  3. Datafaker Gen: Finally, we have the option to use Datafaker Gen, which I want to consider in the current article.

Datafaker Gen Overview

Datafaker Gen offers a command-line generator based on the Datafaker library which allows for the continuous generation of data in various formats and integration with different storage systems, message brokers, and backend services. Since this tool uses Datafaker, there may be a possibility that the data is realistic. Configuration of the scheme, format type, and sink can be done without rebuilding the project.

Datafake Gen consists of the following main components that can be configured:

1. Schema Definition

Users can define the schema for their records in the config.yaml file. The schema specifies the field definitions of the record based on the Datafaker provider. It also allows for the definition of embedded fields.

YAML
 
default_locale: en-EN
fields:
  - name: lastname
    generators: [ Name#lastName ]
  - name: firstname
    generators: [ Name#firstName ]


2. Format

Datafake Gen allows users to specify the format in which records will be generated. Currently, there are basic implementations for CSV, JSON, SQL, XML, and YAML formats. Additionally, formats can be extended with custom implementations. The configuration for formats is specified in the output.yaml file.

YAML
 
formats:
  csv:
    quote: "@"
    separator: $$$$$$$
  json:
    formattedAs: "[]"
  yaml:
  xml:
    pretty: true


3. Sink

The sink component determines where the generated data will be stored or published. The basic implementation includes command-line output and text file sinks. Additionally, sinks can be extended with custom implementations such as RabbitMQ, as demonstrated in the current article. The configuration for sinks is specified in the output.yaml file.

YAML
 
sinks:
  rabbitmq:
    batchsize: 1 # when 1 message contains 1 document, when >1 message contains a batch of documents
    host: localhost
    port: 5672
    username: guest
    password: guest
    exchange: test.direct.exchange
    routingkey: products.key


Extensibility via Java SPI

Datafake Gen uses the Java SPI (Service Provider Interface) to make it easy to add new formats or sinks. This extensibility allows for customization of Datafake Gen according to specific requirements.

How To Add a New Sink in Datafake Gen

Before adding a new sink, you may want to check if it already exists in the datafaker-gen-examples repository. If it does not exist, you can refer to examples on how to add a new sink.

When it comes to extending Datafake Gen with new sink implementations, developers have two primary options to consider:

  1. By using this parent project, developers can implement sink interfaces for their sink extensions, similar to those available in the datafaker-gen-examples repository.
  2. Include dependencies from the Maven repository to access the required interfaces. For this approach, Datafake Gen should be built and exist in the local Maven repository. This approach provides flexibility in project structure and requirements.

1. Implementing RabbitMQ Sink

To add a new RabbitMQ sink, one simply needs to implement the net.datafaker.datafaker_gen.sink.Sink interface.

This interface contains two methods:

  1. getName - This method defines the sink name.
  2. run - This method triggers the generation of records and then sends or saves all the generated records to the specified destination. The method parameters include the configuration specific to this sink retrieved from the output.yaml file as well as the data generation function and the desired number of lines to be generated.
Java
 
import net.datafaker.datafaker_gen.sink.Sink;

public class RabbitMqSink implements Sink {

    @Override
    public String getName() {
        return "rabbitmq";
    }

    @Override
    public void run(Map<String, ?> config, Function<Integer, ?> function, int numberOfLines) {
        // Read output configuration ...
        int numberOfLinesToPrint = numberOfLines;
        String host = (String) config.get("host");
      
        // Generate lines 
        String lines = (String) function.apply(numberOfLinesToPrint);

        // Sending or saving results to the expected resource
        // In this case, this is connecting to RebbitMQ and sending messages.
        ConnectionFactory factory = getConnectionFactory(host, port, username, password);
        try (Connection connection = factory.newConnection()) {
            Channel channel = connection.createChannel();
            JsonArray jsonArray = JsonParser.parseString(lines).getAsJsonArray();
            jsonArray.forEach(jsonElement -> {
                try {
				    channel.basicPublish(exchange, routingKey, null, jsonElement.toString().getBytes());
                } catch (Exception e) {
				    throw new RuntimeException(e);
                }
            });
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}


2. Adding Configuration for the New RabbitMQ Sink

As previously mentioned, the configuration for sinks or formats can be added to the output.yaml file. The specific fields may vary depending on your custom sink. Below is an example configuration for a RabbitMQ sink:

YAML
 
sinks:
  rabbitmq:
    batchsize: 1 # when 1 message contains 1 document, when >1 message contains a batch of documents
    host: localhost
    port: 5672
    username: guest
    password: guest
    exchange: test.direct.exchange
    routingkey: products.key


3. Adding Custom Sink via SPI

Adding a custom sink via SPI (Service Provider Interface) involves including the provider configuration in the ./resources/META-INF/services/net.datafaker.datafaker_gen.sink.Sink file. This file contains paths to the sink implementations:

Properties files
 
net.datafaker.datafaker_gen.sink.RabbitMqSink


These are all 3 simple steps on how to expand Datafake Gen. In this example, we are not providing a complete implementation of the sink, as well as how to use additional libraries. To see the complete implementations, you can refer to the datafaker-gen-rabbitmq module in the example repository.

How To Run

Step 1

Build a JAR file based on the new implementation:

Shell
 
./mvnw clean verify


Step 2

Define the schema for records in the config.yaml file and place this file in the appropriate location where the generator should run. Additionally, define the sinks and formats in the output.yaml file, as demonstrated previously.

Step 3

Datafake Gen can be executed through two options:

  1. Use bash script from the bin folder in the parent project:

    Shell
     
    # Format json, number of lines 10000 and new RabbitMq Sink
    bin/datafaker_gen -f json -n 10000 -sink rabbitmq


2. Execute the JAR directly, like this:

Shell
 
java -cp [path_to_jar] net.datafaker.datafaker_gen.DatafakerGen -f json -n 10000 -sink rabbitmq


How Fast Is It?

The test was done based on the scheme described above, which means that one document consists of two fields. Documents are recorded one by one in the RabbitMQ queue in JSON format. The table below shows the speed for 10,000, 100,000, and 1M records on my local machine:

Records Time
10000 401 ms
100000 11613ms
1000000 121601ms

Conclusion

The Datafake Gen tool enables the creation of flexible and fast data generators for various types of destinations. Built on Datafaker, it facilitates realistic data generation. Developers can easily configure the content of records, formats, and sinks to suit their needs. As a simple Java application, it can be deployed anywhere you want, whether it's in Docker or on-premise machines.

  • The full source code is available here.
  • I would like to thank Sergey Nuyanzin for reviewing this article.

Thank you for reading, and I am glad to be of help.

JSON Service provider interface Data (computing) Java virtual machine

Opinions expressed by DZone contributors are their own.

Related

  • Modify JSON Data in Postgres and Hibernate 6
  • Exploring JSON Schema for Form Validation in Web Components
  • Datafaker Gen: Leveraging BigQuery Sink on Google Cloud Platform
  • Exploring the Power of the Functional Programming Paradigm

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!