DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Scaling Cloud Data Automation: A Practical Guide to Open Table Formats
  • Want To Build Successful Data Products? Start With Ingestion and Integration
  • What Is a Streaming Database?
  • What Is IoT Gateway? Is It Important

Trending

  • Data Contracts as the "Circuit Breaker" for Model Reliability
  • Building a High-Throughput Distributed Sequence Generator Using the Hi-Lo Algorithm
  • Slopsquatting: Building a Scanner That Catches AI-Hallucinated Packages Before They Reach Production
  • Testing AI-Infused Apps: A Dual-Layer Framework for AI Quality Assurance
  1. DZone
  2. Data Engineering
  3. Data
  4. DataFlow — An Open-Source Data Preparation System Accelerating LLM Training

DataFlow — An Open-Source Data Preparation System Accelerating LLM Training

As data preparation becomes critical to LLM training, DataFlow emerges as an open-source system designed to automatically and systematically produce AI-ready data.

By 
Nan Xiang user avatar
Nan Xiang
·
Mar. 11, 26 · Presentation
Likes (1)
Comment
Save
Tweet
Share
3.9K Views

Join the DZone community and get the full member experience.

Join For Free

Competition among large language models (LLMs) has intensified significantly over the past two years, with many believing that their core competitiveness lies in algorithms. However, this is not the case. The current open-source ecosystem has made mainstream architectures increasingly transparent — model structures such as Llama, GPT, and Gemma can all be publicly reproduced, and the competitive edge at the algorithmic level is rapidly eroding. The real competitive barrier actually exists at a more fundamental level — data.

Data is the sole source of knowledge for LLMs, and data quality determines a model's "emotional intelligence" and "intelligence quotient." This means the development of LLMs has largely relied on large-scale, high-quality training data. However, most mainstream training datasets and their processing workflows remain undisclosed, and the scale and quality of publicly available data resources are still limited. This poses significant challenges for the community in building and optimizing training data for LLMs.

Additionally, although there are already a large number of open-source datasets, making them AI-ready remains an obstacle for both the community and industry due to a lack of systematic and efficient tool support. Existing data processing tools, such as Hadoop and Spark, mostly support operators oriented toward traditional methods rather than effectively integrating intelligent operators based on the latest LLMs. Moreover, they provide limited support for constructing training data for advanced large models. How can we address this dilemma?

DataFlow: A Data Preparation Engine for LLMs

As data preparation becomes the main battlefield of competition, the open-source technology ecosystem is becoming the key to breaking the deadlock. That’s why we created DataFlow, a data-centric AI system that transforms “black-boxed” data preparation engineering capabilities into reusable and scalable open-source AI infrastructure.

DataFlow: Data  Centric AI System

DataFlow fully supports text-modality data governance and also supports extracting and translating text content from PDFs, web pages, and audio. The processed data can be used for pre-training, supervised fine-tuning (SFT), and reinforcement fine-tuning of LLMs. It can effectively improve the inference and retrieval capabilities of LLMs in both general domains and specific domains such as healthcare, finance, and law.

DataFlow Technical Framework

When the complexity of LLM data preparation becomes the biggest bottleneck for model evolution, the traditional pattern of “isolated tools + manual orchestration” is clearly not the optimal solution. The technical framework of DataFlow follows a streaming architecture of “input → processing → output,” covering the entire journey from raw data processing to application implementation. Its core is divided into three major layers:

Data Flow System


Data Input Layer

DataFlow supports multimodal machine learning data, such as JSON, PDFs, images, and videos.

Key Design:

  • Unified Data Carrier: A pandas DataFrame carries multimodal machine learning data in a structured format.
  • Scalability: A reserved multimodal processing interface (the current version focuses on text; image and video support are under development).

Core Processing Layer

The core functionality of DataFlow lies in the processing layer, which comprises three modules: Operator, Pipeline, and Agent.

DataFlow Operator System

An operator is a basic data processing unit that typically executes logic based on rules, deep learning models, or LLMs.

Operator Type Use Cases
MultiModal Operators PNG → OCR
MP4 → automatic speech recognition
image → text description
General-purpose operators Data filter/deduplicate/diversity control
Domain-specific operators Medical entity identification, financial compliance testing
Evaluation operators Give a score on Security/Complexity/Inference Difficulty


DataFlow Pipeline

A pipeline is the logical orchestration of multiple DataFlow operators, designed to complete a full data processing task. DataFlow currently provides eight pipelines as references, and they can also be customized or modified.

Preinstalled Pipelines (Out of the Box):

  • Strong Reasoning Synthesis: Generate mathematical or code reasoning chain data
  • Agentic RAG Optimization: Build a high-quality knowledge base for Retrieval-Augmented Generation
  • Text2SQL: Precise mapping from natural language to SQL
  • Knowledge Base Cleaning: Extract information from PDFs, web pages, and audio, and construct RAG knowledge fragments or question–answer pairs
  • ……

Customized Pipelines:

  • Graphical Drag-and-Drop: Connect operators to build a DAG (no code required)
  • YAML Configuration: Supports versioned management and reuse

DataFlow Agent

The DataFlow Agent is an automated task-processing system based on multi-agent collaboration. It covers the entire workflow of “task decomposition → tool registration → scheduling and execution → result verification → report generation,” and is designed for the intelligent management and execution of complex tasks.


Some agent capabilities include:

  • Automatically arranging operators according to user queries to form new pipelines
  • Automatically writing new operators based on user queries
  • Automatically resolving data analysis tasks

Output Layer

The generated high-quality data can meet the requirements of LLM training and industry scenarios. Examples include:

  • Multi-dimensional Assessment Reports: Visual displays of quality improvements in cleaned or synthesized data
  • Downstream Scenario Support, including:
    • Model Training: High-quality data for all stages of pre-training, SFT, and RLHF
    • Vector Databases: Output of <Question, Evidence Fragment, Answer> triples adapted for RAG
    • Domain-specific Knowledge Bases: Knowledge Q&A and decision support for the medical and financial industries
    • ……

DataFlow Quick Start Guide

Next, let’s review best practices for installing and deploying DataFlow.

Environment Preparation

System Requirements:

  • Operating System: Linux / macOS / Windows (Linux recommended)
  • Python: Version 3.10 or higher
  • Conda: For environment isolation and dependency management
  • IDE: VSCode or PyCharm

Recommended Directory Structure:

Plain Text
 
workspace/
 ├── dataflow_env/
 ├── pipelines/
 ├── data/
 ├── cache_local/
 └── logs/


(Note: Simply prepare an empty folder named workspace. The subdirectories, such as pipelines, can be automatically generated by subsequent commands.)

Environment Configuration

Create a Conda Environment

Shell
 
conda create -n dataflow python=3.10 -y
conda activate dataflow


Tip: The -y parameter automatically confirms installation. Without it, you will be prompted to enter y manually.

Install DataFlow

Shell
 
pip install open-dataflow
# or
pip install "open-dataflow[vllm]"

We recommend using pip install open-dataflow initially. If you have a GPU, you can later install the version with vllm.

Verify Installation

Shell
 
dataflow -v

If the following message appears, the installation was successful:

Successful Installation


Plain Text
 
You are using the latest version: x.x.x


Note: The message indicating the successful installation is the same for all versions.

Project Initialization and Operation Verification

Initialize the Project Directory

Shell
 
dataflow init

After execution, a default pipeline example and configuration file will be generated in the current directory (as shown in the figure below).

Run the Example Pipeline

  1. Locate the target pipeline file in the working directory.
  2. Configure the data source (sample data can be found in the example data directory).
  3. Input command: python + target pipeline file path
Run:
Shell
 
python example_data/example_pipeline.py


The result file will be generated in the cache_local/ directory.

Advanced Deployment Practice

Build from Source

Suitable for developers who need to modify underlying logic or debug the framework.

Shell
 
git clone https://github.com/OpenDCAI/DataFlow.git
conda create -n dataflow_diy python=3.10
conda activate dataflow_diy
cd DataFlow
pip install -e .


Verify installation:

Shell
 
dataflow -v


Download a Dataset from Hugging Face

Install the package:

Shell
 
pip install huggingface_hub


Create hf_download.sh:

Shell
 
export HF_ENDPOINT=https://hf-mirror.com  
rep="<huggingface dataset name>"
local_dir="./data"
huggingface-cli download $rep \
  --repo-type dataset \
  --local-dir $local_dir \
  --force


Run the script, and the style of the downloaded dataset is shown in the figure below:

Shell
 
bash hf_download.sh


Run the Script

Run a Custom Pipeline

The steps are similar to those above:

Shell
 
python pipelines/custom_pipeline.py --config config/custom.json

The input source, operator order, and output path can be flexibly controlled through the configuration file.

That concludes the quick start guide for DataFlow. Technical documentation is also available, and the community is welcome to share insights and contribute.

Conclusion: A New Paradigm for Data Engineering

As the open-source LLM ecosystem continues to grow, one pattern is becoming clear: models evolve quickly, but data challenges remain difficult. DataFlow reframes data as a first-class, evolving system. It introduces operators for each stage of data processing — parsing, generation, filtering, evaluation, and feedback — that can be versioned, debugged, and improved independently, just like model code.

For developers building, training, and maintaining open-source LLM systems, this shared structure transforms isolated efforts into collective progress.

Data processing Data (computing) systems large language model

Opinions expressed by DZone contributors are their own.

Related

  • Scaling Cloud Data Automation: A Practical Guide to Open Table Formats
  • Want To Build Successful Data Products? Start With Ingestion and Integration
  • What Is a Streaming Database?
  • What Is IoT Gateway? Is It Important

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook