DataFlow — An Open-Source Data Preparation System Accelerating LLM Training

As data preparation becomes critical to LLM training, DataFlow emerges as an open-source system designed to automatically and systematically produce AI-ready data.

Nan Xiang

Mar. 11, 26 · Presentation

Likes (1)

Comment

Save

4.1K Views

Competition among large language models (LLMs) has intensified significantly over the past two years, with many believing that their core competitiveness lies in algorithms. However, this is not the case. The current open-source ecosystem has made mainstream architectures increasingly transparent — model structures such as Llama, GPT, and Gemma can all be publicly reproduced, and the competitive edge at the algorithmic level is rapidly eroding. The real competitive barrier actually exists at a more fundamental level — data.

Data is the sole source of knowledge for LLMs, and data quality determines a model's "emotional intelligence" and "intelligence quotient." This means the development of LLMs has largely relied on large-scale, high-quality training data. However, most mainstream training datasets and their processing workflows remain undisclosed, and the scale and quality of publicly available data resources are still limited. This poses significant challenges for the community in building and optimizing training data for LLMs.

Additionally, although there are already a large number of open-source datasets, making them AI-ready remains an obstacle for both the community and industry due to a lack of systematic and efficient tool support. Existing data processing tools, such as Hadoop and Spark, mostly support operators oriented toward traditional methods rather than effectively integrating intelligent operators based on the latest LLMs. Moreover, they provide limited support for constructing training data for advanced large models. How can we address this dilemma?

DataFlow: A Data Preparation Engine for LLMs

As data preparation becomes the main battlefield of competition, the open-source technology ecosystem is becoming the key to breaking the deadlock. That’s why we created DataFlow, a data-centric AI system that transforms “black-boxed” data preparation engineering capabilities into reusable and scalable open-source AI infrastructure.

DataFlow fully supports text-modality data governance and also supports extracting and translating text content from PDFs, web pages, and audio. The processed data can be used for pre-training, supervised fine-tuning (SFT), and reinforcement fine-tuning of LLMs. It can effectively improve the inference and retrieval capabilities of LLMs in both general domains and specific domains such as healthcare, finance, and law.

DataFlow Technical Framework

When the complexity of LLM data preparation becomes the biggest bottleneck for model evolution, the traditional pattern of “isolated tools + manual orchestration” is clearly not the optimal solution. The technical framework of DataFlow follows a streaming architecture of “input → processing → output,” covering the entire journey from raw data processing to application implementation. Its core is divided into three major layers:

Data Input Layer

DataFlow supports multimodal machine learning data, such as JSON, PDFs, images, and videos.

Key Design:

Unified Data Carrier: A pandas DataFrame carries multimodal machine learning data in a structured format.
Scalability: A reserved multimodal processing interface (the current version focuses on text; image and video support are under development).

Core Processing Layer

The core functionality of DataFlow lies in the processing layer, which comprises three modules: Operator, Pipeline, and Agent.

DataFlow Operator System

An operator is a basic data processing unit that typically executes logic based on rules, deep learning models, or LLMs.

Operator Type	Use Cases
MultiModal Operators	PNG → OCR MP4 → automatic speech recognition image → text description
General-purpose operators	Data filter/deduplicate/diversity control
Domain-specific operators	Medical entity identification, financial compliance testing
Evaluation operators	Give a score on Security/Complexity/Inference Difficulty

DataFlow Pipeline

A pipeline is the logical orchestration of multiple DataFlow operators, designed to complete a full data processing task. DataFlow currently provides eight pipelines as references, and they can also be customized or modified.

Preinstalled Pipelines (Out of the Box):

Strong Reasoning Synthesis: Generate mathematical or code reasoning chain data
Agentic RAG Optimization: Build a high-quality knowledge base for Retrieval-Augmented Generation
Text2SQL: Precise mapping from natural language to SQL
Knowledge Base Cleaning: Extract information from PDFs, web pages, and audio, and construct RAG knowledge fragments or question–answer pairs
……

Customized Pipelines:

Graphical Drag-and-Drop: Connect operators to build a DAG (no code required)
YAML Configuration: Supports versioned management and reuse

DataFlow Agent

The DataFlow Agent is an automated task-processing system based on multi-agent collaboration. It covers the entire workflow of “task decomposition → tool registration → scheduling and execution → result verification → report generation,” and is designed for the intelligent management and execution of complex tasks.

Some agent capabilities include:

Automatically arranging operators according to user queries to form new pipelines
Automatically writing new operators based on user queries
Automatically resolving data analysis tasks

Output Layer

The generated high-quality data can meet the requirements of LLM training and industry scenarios. Examples include:

Multi-dimensional Assessment Reports: Visual displays of quality improvements in cleaned or synthesized data
Downstream Scenario Support, including:
- Model Training: High-quality data for all stages of pre-training, SFT, and RLHF
- Vector Databases: Output of <Question, Evidence Fragment, Answer> triples adapted for RAG
- Domain-specific Knowledge Bases: Knowledge Q&A and decision support for the medical and financial industries
- ……

DataFlow Quick Start Guide

Next, let’s review best practices for installing and deploying DataFlow.

Environment Preparation

System Requirements:

Operating System: Linux / macOS / Windows (Linux recommended)
Python: Version 3.10 or higher
Conda: For environment isolation and dependency management
IDE: VSCode or PyCharm

Recommended Directory Structure:

      Plain Text
     
 

     workspace/
 ├── dataflow_env/
 ├── pipelines/
 ├── data/
 ├── cache_local/
 └── logs/
    

(Note: Simply prepare an empty folder named workspace. The subdirectories, such as pipelines, can be automatically generated by subsequent commands.)

Environment Configuration

Create a Conda Environment

      Shell
     
     conda create -n dataflow python=3.10 -y
conda activate dataflow

Tip: The -y parameter automatically confirms installation. Without it, you will be prompted to enter y manually.

Install DataFlow

      Shell
     
     pip install open-dataflow
# or
pip install "open-dataflow[vllm]"

We recommend using pip install open-dataflow initially. If you have a GPU, you can later install the version with vllm.

Verify Installation

      Shell
     
     dataflow -v

If the following message appears, the installation was successful:

      Plain Text
     
     You are using the latest version: x.x.x

Note: The message indicating the successful installation is the same for all versions.

Project Initialization and Operation Verification

Initialize the Project Directory

      Shell
     
     dataflow init

After execution, a default pipeline example and configuration file will be generated in the current directory (as shown in the figure below).

Run the Example Pipeline

Locate the target pipeline file in the working directory.
Configure the data source (sample data can be found in the example data directory).
Input command: python + target pipeline file path

Run:

     Shell
    
    python example_data/example_pipeline.py

The result file will be generated in the cache_local/ directory.

Advanced Deployment Practice

Build from Source

Suitable for developers who need to modify underlying logic or debug the framework.

     Shell
    
 

    git clone https://github.com/OpenDCAI/DataFlow.git
conda create -n dataflow_diy python=3.10
conda activate dataflow_diy
cd DataFlow
pip install -e .
   

Verify installation:

     Shell
    
    dataflow -v

Download a Dataset from Hugging Face

Install the package:

     Shell
    
    pip install huggingface_hub

Create hf_download.sh:

     Shell
    
 

    export HF_ENDPOINT=https://hf-mirror.com  
rep="<huggingface dataset name>"
local_dir="./data"
huggingface-cli download $rep \
  --repo-type dataset \
  --local-dir $local_dir \
  --force
   

Run the script, and the style of the downloaded dataset is shown in the figure below:

     Shell
    
    bash hf_download.sh

Run a Custom Pipeline

The steps are similar to those above:

     Shell
    
    python pipelines/custom_pipeline.py --config config/custom.json

The input source, operator order, and output path can be flexibly controlled through the configuration file.

That concludes the quick start guide for DataFlow. Technical documentation is also available, and the community is welcome to share insights and contribute.

Conclusion: A New Paradigm for Data Engineering

As the open-source LLM ecosystem continues to grow, one pattern is becoming clear: models evolve quickly, but data challenges remain difficult. DataFlow reframes data as a first-class, evolving system. It introduces operators for each stage of data processing — parsing, generation, filtering, evaluation, and feedback — that can be versioned, debugged, and improved independently, just like model code.

For developers building, training, and maintaining open-source LLM systems, this shared structure transforms isolated efforts into collective progress.

Data processing Data (computing) systems large language model

Opinions expressed by DZone contributors are their own.

Related

Trending