DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Papers with Code Trending 12/18/25, 03:46 PM Papers

llm-driven data-preparation workflow-automation data-centric-ai open-source pipeline-framework

Summary

DataFlow is an LLM-driven framework for automated data preparation and workflow engineering, featuring nearly 200 reusable operators and six domain-general pipelines that improve LLM performance across tasks like math, code, and Text-to-SQL.

The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.

Original Article

View Cached Full Text

Cached at: 05/17/26, 12:26 AM

Paper page - DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Source: https://huggingface.co/papers/2512.16676 Published on Dec 18, 2025

#1 Paper of the day Authors:

Abstract

DataFlow is an LLM-driven data preparation framework that enhances data quality and reproducibility for various tasks, improving LLM performance with automatically generated pipelines.

The rapidly growing demand for high-quality data inLarge Language Models (LLMs)has intensified the need for scalable, reliable, and semantically richdata preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we presentDataFlow, a unified and extensible LLM-driven data preparation framework.DataFlowis designed withsystem-level abstractionsthat enable modular, reusable, and composable data transformations, and provides aPyTorch-style pipeline construction APIfor building debuggable and optimizabledataflows. The framework consists of nearly 200reusable operatorsand sixdomain-general pipelinesspanning text, mathematical reasoning, code,Text-to-SQL,agentic RAG, andlarge-scale knowledge extraction. To further improve usability, we introduceDataFlow-Agent, which automatically translates natural-language specifications into executable pipelines viaoperator synthesis,pipeline planning, anditerative verification. Across six representative use cases,DataFlowconsistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy inText-to-SQLover SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced byDataFlowenables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate thatDataFlowprovides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.

View arXiv page View PDF Project page GitHub3.65k Add to collection

Get this paper in your agent:

hf papers read 2512\.16676

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2512.16676 in a model README.md to link it from this page.

Datasets citing this paper9

#### OpenDCAI/dataflow-demo-Text2SQL Preview• UpdatedFeb 4 • 1.11k • 1 #### OpenDCAI/dataflow-mm-context_vqa Viewer• UpdatedJan 30 • 206k • 372 #### OpenDCAI/dataflow-instruct-10k Viewer• UpdatedDec 29, 2025 • 10k • 259 • 6 #### OpenDCAI/Infinity-Instruct-Curated-1M Viewer• UpdatedMar 16 • 1M • 113 Browse 9 datasets citing this paper### Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2512.16676 in a Space README.md to link it from this page.

Collections including this paper18

Browse 18 collections that include this paper

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Paper page - DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Abstract

Models citing this paper0

Datasets citing this paper9

Collections including this paper18

Similar Articles

@tom_doerr: Generates LLM-ready datasets from raw data https://github.com/OpenDCAI/DataFlow…

@llama_index: Most AI pipelines are only as good as the data we provide them with, and that usually means PDFs or other unstructured …

langflow-ai/langflow

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

Submit Feedback

Similar Articles

@tom_doerr: Generates LLM-ready datasets from raw data https://github.com/OpenDCAI/DataFlow…

@llama_index: Most AI pipelines are only as good as the data we provide them with, and that usually means PDFs or other unstructured …

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

FlowCompile: An Optimizing Compiler for Structured LLM Workflows