Exploring Autonomous Agentic Data Engineering for Model Specialization

arXiv cs.CL Papers

Summary

This paper formalizes Autonomous Agentic Data Engineering, where LLMs act as autonomous data engineers to curate and optimize training data for specialized domains, showing a 57.29% improvement in student model performance using GPT-5.2.

arXiv:2605.30407v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize \textbf{Autonomous Agentic Data Engineering}, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by \textbf{57.29\%}, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specialization\footnote{Code will be released at https://github.com/zjunlp/DataAgent.}.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:22 AM

# Exploring Autonomous Agentic Data Engineering for Model Specialization
Source: [https://arxiv.org/html/2605.30407](https://arxiv.org/html/2605.30407)
Yujie Luo♠♡,Xiangyuan Ru♠11footnotemark:1,Jingsheng Zheng♠,Jingjing Wang♠,Yuqi Zhu♠, Jintian Zhang♠,Runnan Fang♠,Kewei Xu♠,Ye Liu♡,Zheng Wei♡, Jiang Bian♡,Zang Li♡,Shumin Deng♠††footnotemark: ♠Zhejiang University ♡Platform and Content Group, Tencent \{luo\.yj,231sm\}@zju\.edu\.cn

###### Abstract

Large Language Models \(LLMs\) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high\-quality domain\-specific data\. Existing LLM\-based data curation methods primarily rely on human\-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end\-to\-end data engineering pipeline for model specialization\. We formalizeAutonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end\-to\-end data curation\. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post\-training performance improvement\. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT\-5\.2 constructs a training curriculum that improves a student model by57\.29%, entirely through iterative, agent\-driven data adaptation\. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent\-driven model specialization111Code will be released at[https://github\.com/zjunlp/DataAgent](https://github.com/zjunlp/DataAgent)\.\.

Exploring Autonomous Agentic Data Engineering for Model Specialization

## 1Introduction

Large Language Models \(LLMs\) have acquired emergent capabilities through training on massive amounts of dataGuha et al\. \([2025](https://arxiv.org/html/2605.30407#bib.bib5)\); Zhou et al\. \([2025](https://arxiv.org/html/2605.30407#bib.bib39)\)in recent years\. Despite strong performance on general tasks, even the most advanced LLMs often struggle to adapt when their training data do not adequately reflect specialized downstream tasksLi et al\. \([2024](https://arxiv.org/html/2605.30407#bib.bib12)\); Mishra et al\. \([2022](https://arxiv.org/html/2605.30407#bib.bib18)\)\.

Adapting a general\-purpose model to a target specialized domain typically necessitates post\-training on domain\-specific instruction data, as exemplified by curated corporaZhang et al\. \([2024](https://arxiv.org/html/2605.30407#bib.bib36)\); Yang et al\. \([2023](https://arxiv.org/html/2605.30407#bib.bib33)\)\. Given the complexity of data processing and the scarcity of high\-quality domain data, researchers have increasingly turned to LLM\-based methodsQiao et al\. \([2024](https://arxiv.org/html/2605.30407#bib.bib20)\); Liang et al\. \([2025](https://arxiv.org/html/2605.30407#bib.bib15)\), utilizing LLMs as data generators within human\-designed workflows\. As adapting these handcrafted recipes to new domains requires extensive configuration, modern LLM agents offer a more promising alternative through their remarkable advances in complex reasoningDeepSeek\-AI \([2025](https://arxiv.org/html/2605.30407#bib.bib4)\), code generationNi et al\. \([2023](https://arxiv.org/html/2605.30407#bib.bib19)\); Hong et al\. \([2024](https://arxiv.org/html/2605.30407#bib.bib7)\), and tool useQin et al\. \([2024](https://arxiv.org/html/2605.30407#bib.bib21)\)\. These advances further raise a natural question:Can LLM agents autonomously perform end\-to\-end data engineering for model specialization?

![Refer to caption](https://arxiv.org/html/2605.30407v1/x1.png)Figure 1:Paradigm ofAgentic Data Engineering\. LLM data engineer independently executes the entire data curation loop to drive model specialization, iteratively optimizing data guided by post\-training student model performance feedback\.To investigate this question, we formalize the task ofAutonomous Agentic Data Engineering\(Figure[1](https://arxiv.org/html/2605.30407#S1.F1)\), where LLMs are tasked with completing the entire training data curation pipeline independently, including strategy plan, domain specification, prompt design, data synthesis, data validation, and iterative data optimization\. By holding both the teacher model for data synthesis and the student model for data training fixed, we isolate the end\-to\-end data engineering capability of LLMs, which is ultimately evaluated by the post\-training performance improvement of the student model\.

We conduct a comprehensive analysis of the performance of mainstream LLMs across three specialized domains: Science, Code, and Finance\. LLM capabilities are evaluated under a single\-turn completion agent setting \(One\-Shot\) and a closed\-loop, self\-optimizing agent setting \(Iterative Agent\), both from scratch and with initial seed data\. Experiments show that modern LLM agents possess substantial data engineering capabilities, enabling them to infer missing supervision signals and synthesize task\-aligned instances even from scratch\. Notably, GPT\-5\.2 achieves an average relative*performance gain of*57\.29%through iterative optimization, surpassing human\-crafted data synthesis pipelines\. Despite these encouraging findings, we also identify significant failure modes, suggesting that LLMs still lack robust post\-generation mechanisms for reliable quality assurance\.

Overall, we summarize our contributions as:

- •We formalize the task ofAgentic Data Engineering, an autonomous paradigm in which LLMs independently manage the entire training data curation lifecycle\. This provides a controlled setting for studying end\-to\-end data engineering as a measurable capability of LLM agents\.
- •We develop an end\-to\-end execution & evaluation environment that covers the full data curation pipeline for model specialization, enabling isolated and budget\-controlled agent execution, along with external feedback and a performance\-based evaluation protocol\.
- •We instantiate two representative settings:One\-ShotandIterative Agent, and evaluate mainstream LLMs across diverse domains\. We further provide analysis of iterative optimization, data quality, and failure modes towards specialization\.

## 2Agentic Data Engineering

![Refer to caption](https://arxiv.org/html/2605.30407v1/x2.png)Figure 2:Overall framework of our study\.\(a\) Environment: the overview of the covered domains, the agent input containing task settings and procedural feedback, and the final evaluation method\.\(b\) Agent Workflow: the example workflow in which agents develop strategies to curate data and output asubmission\.jsontowards specialization\. In \(ii\) One\-Shot setting, the submission is produced in a single pass, whereas in \(i\) Iterative Agent setting, the agent iteratively improves its data curation strategy with feedback and reports the best submission\.### 2\.1Problem Formulation

We formalizeAgentic Data Engineering\(Figure[1](https://arxiv.org/html/2605.30407#S1.F1)\) as an end\-to\-end closed\-loop paradigm in which an LLM agent𝒜\\mathcal\{A\}autonomously curates training data to specialize a*fixed*student modelℳS\\mathcal\{M\}\_\{S\}with a*fixed*teacher modelℳT\\mathcal\{M\}\_\{T\}for data synthesis\.

For a target task𝒯\\mathcal\{T\}, the agent designs a data\-curation program𝒫𝒜\\mathcal\{P\}\_\{\\mathcal\{A\}\}that callsℳT\\mathcal\{M\}\_\{T\}to synthesize a candidate dataset

𝒟^=𝒫𝒜​\(𝒯;ℳT\)\.\\widehat\{\\mathcal\{D\}\}\\;=\\;\\mathcal\{P\}\_\{\\mathcal\{A\}\}\(\\mathcal\{T\};\\,\\mathcal\{M\}\_\{T\}\)\.\(1\)The student model is then specialized on𝒟^\\widehat\{\\mathcal\{D\}\}via supervised fine\-tuning, denotedSpec​\(⋅\)\\mathrm\{Spec\}\(\\cdot\), and scored by a deterministic rule\-based evaluatorℰ\\mathcal\{E\}, producing the environmental feedback signal

f=ℰ​\(Spec​\(ℳS,𝒟^\)\)\.f\\;=\\;\\mathcal\{E\}\\bigl\(\\mathrm\{Spec\}\(\\mathcal\{M\}\_\{S\},\\,\\widehat\{\\mathcal\{D\}\}\)\\bigr\)\.\(2\)
Given the synthesis data𝒟^\\widehat\{\\mathcal\{D\}\}and the feedback signalff, the entire agentic data engineering process can be cast as a closed\-loop objective in which agent𝒜\\mathcal\{A\}searches over curation strategies to maximize the student’s post\-training performance:

𝒫𝒜⋆=arg⁡max𝒫𝒜⁡ℰ​\(Spec​\(ℳS,𝒫𝒜​\(𝒯;ℳT\)\)\)\.\\mathcal\{P\}\_\{\\mathcal\{A\}\}^\{\\star\}\\;=\\;\\arg\\max\_\{\\mathcal\{P\}\_\{\\mathcal\{A\}\}\}\\;\\mathcal\{E\}\\\!\\left\(\\mathrm\{Spec\}\\\!\\left\(\\mathcal\{M\}\_\{S\},\\;\\mathcal\{P\}\_\{\\mathcal\{A\}\}\(\\mathcal\{T\};\\,\\mathcal\{M\}\_\{T\}\)\\right\)\\right\)\.\(3\)Under this formulation, bothℳT\\mathcal\{M\}\_\{T\}andℳS\\mathcal\{M\}\_\{S\}are fixed across tasks, enabling controlled analysis of the contribution of agent\-driven data curation to student model specialization\.

### 2\.2Task Protocol

##### Task Input

As shown in Figure[2](https://arxiv.org/html/2605.30407#S2.F2)\(a\), for each task the agent is provided with: \(1\) a brief introduction of the evaluation setting; \(2\) a basic overview of the target dataset, including dataset description, submission format, optional seed pool, and the public test set for validation; \(3\) a fixed budget of teacher model API calls that the agent can use to synthesize data; and \(4\) a fixed student model for domain specialization, together with corresponding standardized fine\-tuning & inference parameters\.222By default, we adoptQwen3\-30B\-A3Bas the teacher model andLLaMA\-3\.1\-8B\-Instructas the student model\.

##### Task Output

The agent is tasked to produce training data𝒟^\\widehat\{\\mathcal\{D\}\}as asubmission\.jsonfile that conforms to the required format\. The submission must be produced by the agent’s generated code, with all instances generated via teacher\-model API calls rather than directly written into the file\.

##### Task Evaluation

We evaluate the agent by improving the end\-to\-end performance of the student model\. Specifically, the student model is fine\-tuned on the submission data𝒟^\\widehat\{\\mathcal\{D\}\}and then evaluated on the hidden private set\. The resulting private\-set performance gain \(Section[3\.1](https://arxiv.org/html/2605.30407#S3.SS1)\) serves as a measure of the agent’s end\-to\-end data engineering capability\.

##### Task Environment

Our running environment enforces fixed budgets on teacher\-model API calls and wall\-clock time, and provides standardized interfaces for teacher API calls, student model fine\-tuning, and public set evaluation, as detailed in Appendix[D](https://arxiv.org/html/2605.30407#A4)\. In this setting, the agent focuses solely on the data engineering task by implementing the data curation logic through code generation\.

### 2\.3Dataset Preparation

We collect QA reasoning tasks from three representative domains:Science,Code, andFinance, evaluating how agents adapt and improve through autonomous data engineering within each domain\.

##### Dataset Selection

We select task domains that satisfy: \(i\)*specialized tasks*that are not adequately covered by general\-purpose pretraining, where targeted specialization is essential to unlock the model’s full potential; \(ii\)*direct evaluation*, enabling deterministic rule\-based scoring serving as environment feedback without execution environments or LLM judgment; and \(iii\)*broad reasoning pattern*across representative domains\. Based on these criteria, we adopt SciBench\(Wang et al\.,[2024b](https://arxiv.org/html/2605.30407#bib.bib29)\), LiveCodeBench\(Jain et al\.,[2025](https://arxiv.org/html/2605.30407#bib.bib9)\)Test Output Prediction \(LCB\-TOP\), and FinanceReasoning\(Tang et al\.,[2025](https://arxiv.org/html/2605.30407#bib.bib25)\)for final evaluation\.

##### Dataset Standardization

We derive task descriptions from their official documentation and redesign the original evaluation logic to be fully rule\-based by removing subjective or LLM judgment components\. In addition, we provide a standardized sample submission file for each task that defines the required format for generated training data\. Ultimately, we normalize each task as:

- •Dataset Description: an overview of the dataset, component illustration, and data examples\.
- •Evaluation Script: a script extracting answers from responses and computing dataset scores\.
- •Seed Data: standardized raw materials for domain specialization, where agent visibility depends on the experiment setting\.
- •Public Test Set: the visible data split for procedural feedback during iterative optimization\.
- •Private Test Set: the hidden data split reserved exclusively for final performance evaluation\.
- •Sample Submission: the required task\-specific data generation format\.

##### Dataset Partition

For seed data construction, we fix a budget of 1,000 instances per task and ensure that all seeds contain only raw questions and associated context without reference answers \(examples in Appendix[H](https://arxiv.org/html/2605.30407#A8)\)\. Specifically, for theSciencetask, we filter SciInstructZhang et al\. \([2024](https://arxiv.org/html/2605.30407#bib.bib36)\)to retain instances with deterministic numeric answers, and then apply data selection strategies for quality\. For theCodetask, we draw seeds from LiveCodeBench releases v1–v6 via stratified sampling, further augmented with stratified samples from TACO\(Li et al\.,[2023](https://arxiv.org/html/2605.30407#bib.bib13)\)\. For theFinancetask, due to limited related resources, we sample half of FinanceReasoning as seed data\. We then construct the public and private splits from SciBench, LCB\-TOP, and the remaining portion of FinanceReasoning\. The resulting Public Test Set and Private Test Set follow a 1:3 split ratio\. Throughout seed construction and test\-set splitting, we enforce strictstratified samplingand rigorouslyensure zero overlap in problems and contexts to prevent data leakage\.

### 2\.4Automatic Data Engineering Agent

We investigate agentic data engineering under two representative scenarios: a single\-turn completion setting \(One\-Shot\) and a closed\-loop, self\-optimizing setting \(Iterative Agent\), both illustrated in Figure[2](https://arxiv.org/html/2605.30407#S2.F2)\(b\)\.

##### One\-Shot

In this setting, the agent generates the final submission in a single pass\. We provide the agent with a comprehensive prompt with the necessary task input\. The agent then drafts a strategy plan, implements it viacode\.py, and producessubmission\.json\(Figure[2](https://arxiv.org/html/2605.30407#S2.F2)\(b\-ii\)\)\. We allow up to 8 independent attempts to mitigate generation failure\. Once a valid submission is generated, the process terminates, and the submission is used to fine\-tune the student model\.

##### Iterative Agent

In this setting, the agent is tasked with continuously enhancing model performance through a closed\-loop data engineering process\. Inspired by recent advances in self\-improving agents\(Madaan et al\.,[2023](https://arxiv.org/html/2605.30407#bib.bib17); Jiang et al\.,[2025](https://arxiv.org/html/2605.30407#bib.bib10)\), we investigate whether LLMs can apply such capabilities to data engineering by leveraging environmental feedback signals\. To this end, we design the Iterative Agent, as illustrated in Figure[2](https://arxiv.org/html/2605.30407#S2.F2)\(b\-i\), incorporating four operations:

- •Draft\. Guided by the task settings and the dataset description, the agent formulates a new data synthesis strategy plan by outlining a plan and implementing it via executable code\.
- •Debug\. When the generated code throws an error during execution, the agent analyzes the traceback to diagnose and fix errors, ensuring the script executes successfully\.
- •Repair\. When the code executes successfully but the generatedsubmission\.jsonfails validation, the agent either refines the synthesis strategy to regenerate data or post\-processes existing instances in the raw data, ensuring the submission meets the required quantity and format\.
- •Improve\. Leveraging environmental feedback, the agent employs iterative improvement: it applies agreedy strategyto select the solution with the highest public score from iteration history, consisting of the plan, code, and submission data, and optimizes it to evolve the synthesis strategy and enhance data quality\.

Specifically, the process initiates with theDraftoperation\. The generated code for data curation first undergoes an execution check, and any failure triggers theDebugoperation\. Upon successful execution, if the output fails the submission validation check \(i\.e\., <= 1,000 samples remain after format filtering\), the process shifts to theRepairoperation\. We capDebugandRepairoperations at 3 consecutive attempts, restarting fromDraftif this limit is exceeded\. If the data validation check passes, the agent submits the curated data, receives feedback from the environment, and proceeds to theImproveoperation accordingly\. This iterative process enables the agent to simultaneously optimize synthesis strategies, prompt designs, and data distributions, continually driving student model specialization \(see Appendix[F](https://arxiv.org/html/2605.30407#A6)for a running example\)\.

Table 1:Main Results\.We reportMATS\(Mean Attempts to Successful Submission; lower is better\) and relative performanceGain\(%\) over the baseLlama\-3\.1\-8B\-Instructmodel \(higher is better\), usingQwen3\-30B\-A3Bas the unified teacher model\. Results are averaged over two runs, with the raw accuracy scores reported in Table[B](https://arxiv.org/html/2605.30407#A2)\.## 3Experiments

### 3\.1Metric Definition

We assess the agentic data engineering capability in a training\-based setting, where the student model’s post\-training performance gain directly reflects the agent’s effectiveness\.

##### Relative Performance Gain \(%\)

To enable consistent comparison across tasks, we report the*relative performance gain*of the student model:

Gain\(%\)=Score​\(ℳS⋆\)−Score​\(ℳS\)Score​\(ℳS\)×100\\mathrm\{Gain\}\(\\%\)\\;=\\;\\frac\{\\mathrm\{Score\}\(\\mathcal\{M\}\_\{S\}^\{\\star\}\)\-\\mathrm\{Score\}\(\\mathcal\{M\}\_\{S\}\)\}\{\\mathrm\{Score\}\(\\mathcal\{M\}\_\{S\}\)\}\\times 100\(4\)whereℳS\\mathcal\{M\}\_\{S\}denotes the initial student model andℳS⋆\\mathcal\{M\}\_\{S\}^\{\\star\}denotes the specialized student model fine\-tuned on the agent’s final data submission\. Positive values indicate successful model specialization, whereas negative values indicate performance degradation\. We follow each source benchmark’s official evaluation metric, with all tasks evaluated by accuracy\. We also report the absolute accuracy of each run in Table[B](https://arxiv.org/html/2605.30407#A2)as a complementary view\.

##### Mean Attempts to Success \(MATS\)

MATS measures the average number of trial attempts to obtain a*successful*data submission\. Given a run withNNattempts, an attempt is marked successful if it generates asubmission\.jsonand the filtered submission retains at least1,0001\{,\}000instances after format validation filtering\. We report

MATS=N∑i=1N𝕀​\[succ​\(i\)\]\\mathrm\{MATS\}=\\frac\{N\}\{\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\\left\[\\mathrm\{succ\}\(i\)\\right\]\}\(5\)wheresucc​\(i\)=1\\mathrm\{succ\}\(i\)=1if theii\-th attempt yields a successful submission andsucc​\(i\)=0\\mathrm\{succ\}\(i\)=0otherwise\.

### 3\.2Experiment Setup

##### Execution Details

We run experiments under the following budgets: 50,000 total teacher API calls per task \(≤\\leq5,000 per attempt\), 3\-hour limit per code execution \(i\.e\., data synthesis\), and 12\-hour timeout limit per run, terminating once any budget is exhausted\. For theOne\-Shotscenario, we allow up to 8 attempts;Iterative Agentsrun for at most 30 iterations\. We fine\-tune Llama\-3\.1\-8B\-Instruct as student model on2×2\\timesH100 GPUs and deploy Qwen3\-30B\-A3B as the teacher model on2×2\\timesH100 GPUs via vLLM with max concurrency of 80 \(details in Appendix[C](https://arxiv.org/html/2605.30407#A3)\)\. To verify generalization, we also evaluate alternative teacher\-student configurations \(see Appendix[E](https://arxiv.org/html/2605.30407#A5)\)\. Each complete iteration cycle \(synthesis→\\totraining→\\toevaluation\) takes 1–2 hours under this setting\. We conduct two independent runs and report the final mean performance gain \(raw accuracy scores in Table[B](https://arxiv.org/html/2605.30407#A2)\)\.

##### Data Initialization Settings

We evaluate both agents under two distinct settings: \(1\)From Scratch: The agent must synthesize the entire dataset relying solely on the task description and teacher model API\. \(2\)With Seed: The agent is additionally provided with a seed pool of 1,000 raw questions \(as described in Section[2\.3](https://arxiv.org/html/2605.30407#S2.SS3)\) to guide the data synthesis and exploration process\.

![Refer to caption](https://arxiv.org/html/2605.30407v1/x3.png)Figure 3:Iteration analysis of performance across successful submissions produced by the Iterative Agent\.![Refer to caption](https://arxiv.org/html/2605.30407v1/x4.png)Figure 4:Quality evaluation of synthesized instructions\.

### 3\.3Main Results

##### Iterative optimization drives gains, while seed data ensures stability\.

As shown in Table[2\.4](https://arxiv.org/html/2605.30407#S2.SS4.SSS0.Px2),*Iterative Agents*consistently outperform*One\-Shot Agents*\. Specifically, in the*from scratch*regime, GPT\-5\.2 improves its average relative gain from40\.73%to57\.29%, demonstrating the efficacy of LLMs in leveraging environment feedback for self\-improvement\. Compared with*One\-Shot*generation, where a single error can corrupt the process \(DeepSeek\-V3\.1’s \-4\.58% drop on Code\), iterative mechanisms mitigate this by repeatedly improving exploration toward higher\-quality solutions\. Furthermore, adding a 1k seed pool consistently improves performance in both settings\. This effect is strongest in the more fragile*One\-Shot from scratch*scenario, where most models obtain30%\+additional relative gains after seeds are introduced\. These results suggest that seed data serves to broaden the agent’s coverage of the target task distribution while injecting essential domain\-relevant knowledge, thereby reducing off\-target generation and low\-quality instances\.

##### LLMs have emerged as independent data engineers for end\-to\-end model specialization\.

Even in the most fragile*One\-Shot from scratch*setting, most agents still deliver positive average gains, and GPT 5\.2 attains roughly a40%improvement of the base model\. Without external knowledge, environment feedback, or any human\-designed workflow, the submission must be produced by agent\-written code independently\. Under these constraints, the observed gains provide concrete evidence of non\-trivialdata engineering ability\. The LLM agents can autonomously infer what supervision the model lacks, synthesize task\-aligned instances, and curate a training set that generalizes to the hidden private task distribution\.

##### Compared to stronger models, weaker models benefit more from sophisticated agent frameworks to unlock capabilities\.

Table[2\.4](https://arxiv.org/html/2605.30407#S2.SS4.SSS0.Px2)reveals a consistent interaction between base model capability and the effectiveness of complex agent frameworks\. Weaker models experience substantially larger improvements from these advanced designs: DeepSeek\-V3\.1 surges from12\.50%in the*one\-shot from scratch*baseline to57\.65%with iterative optimization and seed data, while stronger models such as GPT\-5\.2 and the Claude family show relatively modest gains under the same conditions\. This pattern suggests that feedback\-driven iterative optimization and seed data injection serve as critical guide rails for weaker models\.

## 4Further Analysis

Table 2:Comparison with human\-involved settings\. Detailed configurations see Appendix[C\.5](https://arxiv.org/html/2605.30407#A3.SS5)\.### 4\.1Iterative Data Optimization Analysis

We conduct a controlled analysis to investigate howIterative Agentsimprove and what specific aspects are optimized during iteration\. Specifically, we increase the iteration and API\-call limits and extend the time budget to 48 hours \(Figure[3](https://arxiv.org/html/2605.30407#S3.F3)\), with all runs generated usingGPT\-5\.2\. With*public*score already recorded during the execution loop, we re\-evaluate the student model on the*private*test set for every*successful*submission\. We also report a*final*score which represents the private score of the best\-performing submission on the public leaderboard up to timett, reflecting what the agent would actually select with the greedy strategy\.

##### Iterative Agent demonstrates steady overall improvement across iterations\.

Figure[3](https://arxiv.org/html/2605.30407#S3.F3)shows that*public*,*private*, and*final*scores exhibit a clear upward trend despite some fluctuations across iterations\. Substantial gains typically emerge within the first 8 to 15 iterations, beyond which performance plateaus, indicating diminishing returns as the agent reaches the boundaries of its data awareness and cognitive capacity\(Shinn et al\.,[2023](https://arxiv.org/html/2605.30407#bib.bib23)\)\.

##### Greedy public\-score selection ensures robustness performance\.

The fluctuations in Figure[3](https://arxiv.org/html/2605.30407#S3.F3)reflect the intrinsic variance of synthetic data curation, where minor changes of prompts or generation pipelines can substantially shift answer correctness and the data distribution\. For instance, in Figure[3](https://arxiv.org/html/2605.30407#S3.F3)\(b\) \(Round 6\), a pipeline mistake causes a sharp drop in both public and private performance\. The greedy selection rule mitigates such failures by retaining the best historical submission based on*public*score\(Chen et al\.,[2021](https://arxiv.org/html/2605.30407#bib.bib3)\)\. Given that*public*and*private*performance are largely aligned across iterations, this strategy yields a*final*curve that is noticeably more stable, thus less susceptible to occasional catastrophic regressions\.

##### Iteration primarily drives improvements in data diversity\.

Following prior work\(Kim et al\.,[2025](https://arxiv.org/html/2605.30407#bib.bib11)\), we conduct quality analysis on the generated submission data using six intrinsic metrics: instruction difficulty \(GPT\-4o assessment\), instruction diversity \(embedding similarity\), response quality \(GPT\-4o assessment & Skywork reward model\), response diversity \(embedding similarity\), and response perplexity \(LLaMA\-3\.1\-8B\)\. The diagnostics reveal that both instruction and response diversity consistently increase over iterations, while response quality improves only marginally \(example in Appendix[F](https://arxiv.org/html/2605.30407#A6)\)\. Consistent with prior work\(Yu et al\.,[2024](https://arxiv.org/html/2605.30407#bib.bib35)\), this indicates that iterations primarily expand and diversify synthesized questions rather than enhance quality for existing items\.

### 4\.2Human Involvement Influence Analysis

To systematically examine how varying degrees of human involvement influence data curation, we study several settings as follows:

- •Human\. We build a 2k training set by sampling from SciInstruct with output\-length filtering and diversity\-aware clustering\(Guha et al\.,[2025](https://arxiv.org/html/2605.30407#bib.bib5)\)\. We then use LLMs to rewrite instruction pairs into the target evaluation format\. In this setting, both the data source and the synthesis pipeline are fully specified by humans\.
- •DataFlow\. DataFlow provides a general synthesis from scratch pipeline with predefined strategies for generation, filtering, and refinement\. We adopt it as a strong method representing a human\-designed synthesis pipeline without relying on an external data source\.
- •Iterative Agent \(with seed / from scratch\)\. We report the best\-performing submission across all iterative rounds to approximate the current upper bound of the LLM data engineer without any human\-designed strategies or recipes\.

##### Fully autonomous data engineering shows potential to outperform human\-involved methods\.

Under the same constraint, the pipeline designed by GPT\-5\.2 surpasses the human\-designed DataFlow framework \(Table[2](https://arxiv.org/html/2605.30407#S4.T2)\)\. First, LLMs can flexibly adapt their pipeline design strategy to the target task, automatically aligning the synthesized data to the appropriate domain, difficulty, and output format, rather than relying on rigid human\-designed logic, which is consistent with ORPO\(Yang et al\.,[2024](https://arxiv.org/html/2605.30407#bib.bib32)\)\. Second, environmental feedback acts as a closed\-loop signal that mimics the self\-reflection process\(Shinn et al\.,[2023](https://arxiv.org/html/2605.30407#bib.bib23)\), enabling LLMs to continuously improve their data curation strategies and progressively shape the data distribution toward the specialized domain\.

##### LLMs can match human\-level data complexity, while falling short in generating diverse data\.

As illustrated in Figure[4](https://arxiv.org/html/2605.30407#S3.F4)\(b\), thefrom\-scratchagent successfully approaches the human baseline inInstruction Difficulty, demonstrating its strong capability to self\-curate challenging data\. Nevertheless, it exhibits a performance drop inInstruction DiversityandResponse Diversitycompared to human\-involved settings\. This result reveals a basic limit of purely LLM\-driven data engineering: the generated examples are high\-quality but too repetitive\.

### 4\.3Failure Mode Analysis

![Refer to caption](https://arxiv.org/html/2605.30407v1/x5.png)Figure 5:Error type analysis of valid submission generation failure\.##### Data Submission Failure

As shown in Table[2\.4](https://arxiv.org/html/2605.30407#S2.SS4.SSS0.Px2), LLMs often fail to generate valid data in a single round\. We perform detailed analysis based on the error\-type breakdown \(Figure[5](https://arxiv.org/html/2605.30407#S4.F5)\)\.

- •Lack of Quantity Assurance Awareness\. As shown in Figure[5](https://arxiv.org/html/2605.30407#S4.F5),Insufficient Valid Samplesdominates errors across most models \(e\.g\., GPT\-5\.2 and DeepSeek\-R1\)\. While agents aggressively filter generated data, they lack the awareness to validate the final dataset size and dynamically replenish discarded samples, ultimately failing the 1,000\-instance quantity check\.
- •Weak Format Handling in Complex Domains\. Error distribution is domain\-dependent\.Insufficient Valid Samplesis generally milder in text\-basedFinancetasks\. Conversely,Science\(requiring LaTeX\) andCode\(requiring executable logic\) impose formatting constraints, causing massive data rejection and extraction failures\.
- •Potential Over\-Engineering Trap\. We observe an anomalously high rate ofLLM Output Truncatedspecifically for Claude\-4\-Sonnet \(e\.g\., 52\.63% in Code and 59\.31% in Finance\), revealing a tendency to design overly complex, verbose curation pipelines that exceed task requirements\.

##### Model Specialization Failure

Agents demonstrate capabilities for data engineering in most cases, but also encounter typical failure instances\.

- •Distribution Shift of Data\.In afrom\-scratch Sciencefailure case \(Appendix[G\.1](https://arxiv.org/html/2605.30407#A7.SS1)\), the agent hard\-coded logic forcing 50% of the data budget into just five narrow topics \(e\.g\., Boltzmann distribution\)\. Instead of broad scientific sampling, this skewed generation caused a severe distribution shift\. Consequently, it induced catastrophic forgetting\(Luo et al\.,[2025](https://arxiv.org/html/2605.30407#bib.bib16)\), causing the student model to overfit to these specific sub\-domains while losing broader competency\.
- •Naive Rule\-Based Augmentation\.In afrom\-seed Codefailure case \(Appendix[G\.2](https://arxiv.org/html/2605.30407#A7.SS2)\), the agent employs a naive regex strategy to indiscriminately perturb numerical values, ignoring their distinct roles in control flow\. This severed the semantic link between instructions and executable logic, directly violating the SECON principle\(Zhang et al\.,[2025](https://arxiv.org/html/2605.30407#bib.bib38)\)\. Consequently, instead of valid data expansion, the agent injected syntactically broken noise, severely degrading the student model’s performance\.

Overall, despite demonstrated capability to guide end\-to\-end data curation,LLMs still lack robust post\-generation safeguards for stringent quality assurance and reliable quantity control\.

## 5Related Work

LLM Agent\.LLM agents\(Wang et al\.,[2024a](https://arxiv.org/html/2605.30407#bib.bib28); Xi et al\.,[2025](https://arxiv.org/html/2605.30407#bib.bib31)\)leverage the reasoning capabilities of foundation models, integrated with external tools\(Schick et al\.,[2023](https://arxiv.org/html/2605.30407#bib.bib22)\)and environmental feedback\(Yao et al\.,[2023](https://arxiv.org/html/2605.30407#bib.bib34)\)to complete tasks\. This revolution has catalyzed the emergence of specialized agents across diverse domains, ranging from autonomous data analysis\(Zhang et al\.,[2023](https://arxiv.org/html/2605.30407#bib.bib37)\)to scientific discovery\(Boiko et al\.,[2023](https://arxiv.org/html/2605.30407#bib.bib2)\), primarily consuming data to complete tasks\. Therefore, we focus on agent\-driven data production and optimization\.

Data Centric AI\.In the early stages of LLM development, training heavily relied on high\-quality human\-annotated data\(Stiennon et al\.,[2020](https://arxiv.org/html/2605.30407#bib.bib24)\)\. With the rapid depletion of naturally occurring human text, a crisis of data scarcity has become imminent\(Villalobos et al\.,[2024](https://arxiv.org/html/2605.30407#bib.bib27)\)\. Pioneering work such as Self\-Instruct\(Wang et al\.,[2023](https://arxiv.org/html/2605.30407#bib.bib30)\)introduced a paradigm shift, demonstrating that LLMs can synthesize training data from a small seed set to optimize themselves\(Taori et al\.,[2023](https://arxiv.org/html/2605.30407#bib.bib26); Gunasekar et al\.,[2023](https://arxiv.org/html/2605.30407#bib.bib6)\)\. More recent studies further improve the automated generation of synthetic data\(Huang et al\.,[2025](https://arxiv.org/html/2605.30407#bib.bib8); Liang et al\.,[2025](https://arxiv.org/html/2605.30407#bib.bib15)\), yet remain fundamentally dependent on data synthesis pipelines or recipes designed by humans\. Concurrent to our work, DataPrep\-Bench\(Liang et al\.,[2026](https://arxiv.org/html/2605.30407#bib.bib14)\)proposes a benchmark that evaluates LLM\-based data construction and quality scoring methods\. Rather than benchmarking existing methods, our focus lies in the behavior of individual LLM agents, specifically whether they can autonomously execute the full end\-to\-end data engineering loop driven by downstream feedback for model specialization\.

## 6Conclusion

We present a systematic analysis of*Agentic Data Engineering for Model Specialization*by requiring LLM agents to conduct end\-to\-end data engineering in a closed loop\. Our results across*Science*,*Code*, and*Finance*show that iterative agents consistently yield stronger and more stable specialization, with feedback\-driven iteration improving both data strategy and alignment\. In particular, GPT\-5\.2 reaches a 57\.29% average gain, demonstrating that LLM agents can autonomously author data curricula that drive substantial student model specialization\. Our further failure analysis of the observed dominance of invalid submissions and failed specializations exposes the lack of data assurance awareness of current LLMs\.

## Limitations

We acknowledge several limitations in our work\. First, although focusing on QA tasks allows us to efficiently obtain reliable environmental feedback for closed\-loop optimization, this design restricts our evaluation on open\-ended generation tasks where automated evaluation is difficult to achieve\. Second, despite implementing strict budget caps, the Iterative Agent still demands considerable computational resources for model inference and fine\-tuning\. Finally, while we average the results across multiple runs to mitigate fluctuations, coupling complex end\-to\-end data engineering tasks still introduces unavoidable run\-to\-run variance\. We leave broader task coverage and more cost\-efficient strategies to future work\.

## References

- Barres et al\. \(2025\)Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan\. 2025\.[τ\\tau2\{\}^\{\\mbox\{2\}\}\-bench: Evaluating conversational agents in a dual\-control environment](https://doi.org/10.48550/ARXIV.2506.07982)\.*CoRR*, abs/2506\.07982\.
- Boiko et al\. \(2023\)Daniil A\. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes\. 2023\.[Autonomous chemical research with large language models](https://doi.org/10.1038/S41586-023-06792-0)\.*Nat\.*, 624\(7992\):570–578\.
- Chen et al\. \(2021\)Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, and 39 others\. 2021\.[Evaluating large language models trained on code](https://arxiv.org/abs/2107.03374)\.*Preprint*, arXiv:2107\.03374\.
- DeepSeek\-AI \(2025\)DeepSeek\-AI\. 2025\.[Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning](https://doi.org/10.48550/ARXIV.2501.12948)\.*CoRR*, abs/2501\.12948\.
- Guha et al\. \(2025\)Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, and 31 others\. 2025\.[Openthoughts: Data recipes for reasoning models](https://doi.org/10.48550/ARXIV.2506.04178)\.*CoRR*, abs/2506\.04178\.
- Gunasekar et al\. \(2023\)Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li\. 2023\.[Textbooks are all you need](https://doi.org/10.48550/ARXIV.2306.11644)\.*CoRR*, abs/2306\.11644\.
- Hong et al\. \(2024\)Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber\. 2024\.[Metagpt: Meta programming for A multi\-agent collaborative framework](https://openreview.net/forum?id=VtmBAGCN7o)\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net\.
- Huang et al\. \(2025\)Yue Huang, Siyuan Wu, Chujie Gao, Dongping Chen, Qihui Zhang, Yao Wan, Tianyi Zhou, Chaowei Xiao, Jianfeng Gao, Lichao Sun, and Xiangliang Zhang\. 2025\.[Datagen: Unified synthetic dataset generation via large language models](https://openreview.net/forum?id=F5R0lG74Tu)\.In*The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025*\. OpenReview\.net\.
- Jain et al\. \(2025\)Naman Jain, King Han, Alex Gu, Wen\-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar\-Lezama, Koushik Sen, and Ion Stoica\. 2025\.[Livecodebench: Holistic and contamination free evaluation of large language models for code](https://openreview.net/forum?id=chfJJYC3iL)\.In*The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025*\. OpenReview\.net\.
- Jiang et al\. \(2025\)Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu\. 2025\.[AIDE: ai\-driven exploration in the space of code](https://doi.org/10.48550/ARXIV.2502.13138)\.*CoRR*, abs/2502\.13138\.
- Kim et al\. \(2025\)Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, and Graham Neubig\. 2025\.[Evaluating language models as synthetic data generators](https://aclanthology.org/2025.acl-long.320/)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025*, pages 6385–6403\. Association for Computational Linguistics\.
- Li et al\. \(2024\)Chenxi Li, Yuanhe Tian, Zhaxi Zerong, Yan Song, and Fei Xia\. 2024\.[Challenging large language models with new tasks: A study on their adaptability and robustness](https://doi.org/10.18653/V1/2024.FINDINGS-ACL.485)\.In*Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11\-16, 2024*, Findings of ACL, pages 8140–8162\. Association for Computational Linguistics\.
- Li et al\. \(2023\)Rongao Li, Jie Fu, Bo\-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li\. 2023\.[TACO: topics in algorithmic code generation dataset](https://doi.org/10.48550/ARXIV.2312.14852)\.*CoRR*, abs/2312\.14852\.
- Liang et al\. \(2026\)Hao Liang, Qifeng Cai, Yibo Lin, Jianzhuo Du, Qifeng Xia, Sizhe Qiu, Linzhuang Sun, Meiyi Qiang, Zhaoyang Han, Xiaochen Ma, Bohan Zeng, Ruichuan An, Conghui He, and Wentao Zhang\. 2026\.DataPrep\-Bench: Benchmarking LLMs as Training Data Preparators\.[https://datapreparationbench\.github\.io/assets/DataPrep\-Bench\.pdf](https://datapreparationbench.github.io/assets/DataPrep-Bench.pdf)\.Preprint\.
- Liang et al\. \(2025\)Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, and 1 others\. 2025\.Dataflow: An llm\-driven framework for unified data preparation and workflow automation in the era of data\-centric ai\.*arXiv preprint arXiv:2512\.16676*\.
- Luo et al\. \(2025\)Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang\. 2025\.An empirical study of catastrophic forgetting in large language models during continual fine\-tuning\.*IEEE Transactions on Audio, Speech and Language Processing*\.
- Madaan et al\. \(2023\)Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark\. 2023\.[Self\-refine: Iterative refinement with self\-feedback](http://papers.nips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html)\.In*Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023*\.
- Mishra et al\. \(2022\)Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi\. 2022\.[Cross\-task generalization via natural language crowdsourcing instructions](https://doi.org/10.18653/V1/2022.ACL-LONG.244)\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2022, Dublin, Ireland, May 22\-27, 2022*, pages 3470–3487\. Association for Computational Linguistics\.
- Ni et al\. \(2023\)Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen\-Tau Yih, Sida I\. Wang, and Xi Victoria Lin\. 2023\.[LEVER: learning to verify language\-to\-code generation with execution](https://proceedings.mlr.press/v202/ni23b.html)\.In*International Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA*, Proceedings of Machine Learning Research, pages 26106–26128\. PMLR\.
- Qiao et al\. \(2024\)Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, and Huajun Chen\. 2024\.[Autoact: Automatic agent learning from scratch for QA via self\-planning](https://doi.org/10.18653/V1/2024.ACL-LONG.165)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024*, pages 3003–3021\. Association for Computational Linguistics\.
- Qin et al\. \(2024\)Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun\. 2024\.[Toolllm: Facilitating large language models to master 16000\+ real\-world apis](https://openreview.net/forum?id=dHng2O0Jjr)\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net\.
- Schick et al\. \(2023\)Timo Schick, Jane Dwivedi\-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom\. 2023\.[Toolformer: Language models can teach themselves to use tools](http://papers.nips.cc/paper_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906-Abstract-Conference.html)\.In*Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023*\.
- Shinn et al\. \(2023\)Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao\. 2023\.[Reflexion: language agents with verbal reinforcement learning](http://papers.nips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html)\.In*Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023*\.
- Stiennon et al\. \(2020\)Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M\. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F\. Christiano\. 2020\.[Learning to summarize with human feedback](https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html)\.In*Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6\-12, 2020, virtual*\.
- Tang et al\. \(2025\)Zichen Tang, Haihong E, Ziyan Ma, Haoyang He, Jiacheng Liu, Zhongjun Yang, Zihua Rong, Rongjin Li, Kun Ji, Qing Huang, Xinyang Hu, Yang Liu, and Qianhe Zheng\. 2025\.[Financereasoning: Benchmarking financial numerical reasoning more credible, comprehensive and challenging](https://aclanthology.org/2025.acl-long.766/)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025*, pages 15721–15749\. Association for Computational Linguistics\.
- Taori et al\. \(2023\)Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B\. Hashimoto\. 2023\.[Alpaca: A Strong, Replicable Instruction\-Following Model](https://crfm.stanford.edu/2023/03/13/alpaca.html)\.
- Villalobos et al\. \(2024\)Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn\. 2024\.[Position: Will we run out of data? limits of LLM scaling based on human\-generated data](https://openreview.net/forum?id=ViZcgDQjyG)\.In*Forty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024*\. OpenReview\.net\.
- Wang et al\. \(2024a\)Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen\. 2024a\.[A survey on large language model based autonomous agents](https://doi.org/10.1007/S11704-024-40231-1)\.*Frontiers Comput\. Sci\.*, 18\(6\):186345\.
- Wang et al\. \(2024b\)Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R\. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang\. 2024b\.[Scibench: Evaluating college\-level scientific problem\-solving abilities of large language models](https://openreview.net/forum?id=bq1JEgioLr)\.In*Forty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024*\. OpenReview\.net\.
- Wang et al\. \(2023\)Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A\. Smith, Daniel Khashabi, and Hannaneh Hajishirzi\. 2023\.[Self\-instruct: Aligning language models with self\-generated instructions](https://doi.org/10.18653/V1/2023.ACL-LONG.754)\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2023, Toronto, Canada, July 9\-14, 2023*, pages 13484–13508\. Association for Computational Linguistics\.
- Xi et al\. \(2025\)Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, and 9 others\. 2025\.[The rise and potential of large language model based agents: a survey](https://doi.org/10.1007/S11432-024-4222-0)\.*Sci\. China Inf\. Sci\.*, 68\(2\)\.
- Yang et al\. \(2024\)Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V\. Le, Denny Zhou, and Xinyun Chen\. 2024\.[Large language models as optimizers](https://arxiv.org/abs/2309.03409)\.*Preprint*, arXiv:2309\.03409\.
- Yang et al\. \(2023\)Hongyang Yang, Xiao\-Yang Liu, and Christina Dan Wang\. 2023\.[Fingpt: Open\-source financial large language models](https://doi.org/10.48550/ARXIV.2306.06031)\.*CoRR*, abs/2306\.06031\.
- Yao et al\. \(2023\)Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R\. Narasimhan, and Yuan Cao\. 2023\.[React: Synergizing reasoning and acting in language models](https://openreview.net/forum?id=WE_vluYUL-X)\.In*The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023*\. OpenReview\.net\.
- Yu et al\. \(2024\)Simon Yu, Liangyu Chen, Sara Ahmadian, and Marzieh Fadaee\. 2024\.[Diversify and conquer: Diversity\-centric data selection with iterative refinement](https://arxiv.org/abs/2409.11378)\.*Preprint*, arXiv:2409\.11378\.
- Zhang et al\. \(2024\)Dan Zhang, Ziniu Hu, Sining Zhoubian, Zhengxiao Du, Kaiyu Yang, Zihan Wang, Yisong Yue, Yuxiao Dong, and Jie Tang\. 2024\.[Sciinstruct: a self\-reflective instruction annotated dataset for training scientific language models](http://papers.nips.cc/paper_files/paper/2024/hash/02ee6b7295f720407b56c457b34c54d5-Abstract-Datasets_and_Benchmarks_Track.html)\.In*Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024*\.
- Zhang et al\. \(2023\)Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang\. 2023\.[Data\-copilot: Bridging billions of data and humans with autonomous workflow](https://doi.org/10.48550/ARXIV.2306.07209)\.*CoRR*, abs/2306\.07209\.
- Zhang et al\. \(2025\)Xu Zhang, Zexu Lin, Xiaoyu Hu, Jianlei Wang, Wenpeng Lu, and De\-Yu Zhou\. 2025\.[Secon: Maintaining semantic consistency in data augmentation for code search](https://doi.org/10.1145/3686151)\.*ACM Trans\. Inf\. Syst\.*, 43\(2\)\.
- Zhou et al\. \(2025\)Qiannan Zhou, Fei Xu, Lingxuan Weng, Ruixing Li, Xudong Wu, Li Chen, Zhi Zhou, and Fangming Liu\. 2025\.[Espresso: Cost\-efficient large model training by exploiting GPU heterogeneity in the cloud](https://doi.org/10.1109/INFOCOM55648.2025.11044693)\.In*IEEE INFOCOM 2025 \- IEEE Conference on Computer Communications, London, United Kingdom, May 19\-22, 2025*, pages 1–10\. IEEE\.

## Appendix ADataset Details

To systematically analyze the capability of LLM agents in end\-to\-end data engineering, we curate datasets across three specialized domains:Science,Code, andFinance\. We elaborate on the source datasets and the rationale behind our construction choices below; the standardization protocol and partition rules follow Section[2\.3](https://arxiv.org/html/2605.30407#S2.SS3)\.

##### Science\.

We build the Science task uponSciBench\(Wang et al\.,[2024b](https://arxiv.org/html/2605.30407#bib.bib29)\)andSciInstruct\(Zhang et al\.,[2024](https://arxiv.org/html/2605.30407#bib.bib36)\)\. SciBench evaluates college\-level scientific reasoning across physics, chemistry, and mathematics, providing a rigorous testbed for model specialization, while SciInstruct serves as a diverse instruction\-tuning corpus suitable for seed construction\. Since our environment requires deterministic rule\-based scoring, we filter SciInstruct to retain instances with definitive numeric answers before applying quality\-aware selection to form the seed pool\. SciBench is used exclusively for evaluation, with zero overlap in problems and contexts against the seed pool\.

##### Code\.

For the programming domain, we adoptLiveCodeBench \(LCB\)\(Jain et al\.,[2025](https://arxiv.org/html/2605.30407#bib.bib9)\)andTACO\(Li et al\.,[2023](https://arxiv.org/html/2605.30407#bib.bib13)\)\. We focus on the Test Output Prediction sub\-task of LCB \(LCB\-TOP\), which requires predicting program execution outputs and thus demands deep algorithmic understanding\. Seed instances are drawn from LCB releases v1–v6 via stratified sampling, augmented with stratified samples from TACO to broaden diversity\. The evaluation sets are constructed exclusively from LCB\-TOP under the same zero\-overlap constraint\.

##### Finance\.

For the financial domain, we adoptFinanceReasoning\(Tang et al\.,[2025](https://arxiv.org/html/2605.30407#bib.bib25)\), which targets deep financial logic, numerical reasoning, and domain\-specific text comprehension\. As high\-quality open\-source resources for complex financial reasoning are scarce, both the seed and the evaluation sets are derived from FinanceReasoning via disjoint stratified splits, ensuring no contextual leakage between the seed pool and the test sets\.

## Appendix BMain Result Details

Table 3:Base\-model performance on each task\. We useLLaMA\-3\.1\-8B\-Instructas the unified base model\.Table[3](https://arxiv.org/html/2605.30407#A2.T3)details the baseline performance of our unified backbone model,LLaMA\-3\.1\-8B\-Instruct, across the three target domains\. These scores reflect the model’s capabilities on the private test sets prior to any instruction tuning or distillation\. Specifically, the base model achieves accuracy scores of 16\.74 on science domain \(SciBench\), 21\.18 on code domain \(LCB\-TOP\), and 39\.93 on finance domain \(FinanceReasoning\)\. These results establish a performance baseline, highlighting the challenges of these specialized tasks for the off\-the\-shelf model and serving as a reference for quantifying the improvements gained through our synthetic data generation methods\.

Table[B](https://arxiv.org/html/2605.30407#A2)provides the comprehensive raw data from our main experiments\. We report two key metrics: MATS \(Mean Attempts to Success Submission\), which quantifies the efficiency of the agent in generating valid datasets, and Accuracy \(%\), which measures the performance of the student model fine\-tuned on the synthesized data\. The results cover bothSpecialization from ScratchandSpecialization with Seedsettings across all three domains\. To demonstrate the stability of our approach, we report results from at least two independent runs for each agent configuration\.

Table 4:Raw scores of the main experiment\. We reportMATS\(Mean Attempts to Success Submission\) and the Accuracy \(Acc, %\) of the fine\-tuned student modelLLaMA\-3\.1\-8B\-Instruct\. Scores of the original student model are reported in Table[3](https://arxiv.org/html/2605.30407#A2.T3)\. The teacher model is set toQwen3\-30B\-A3Bglobally\.## Appendix CExperiment Configuration Details

In this section, we provide a comprehensive breakdown of the experimental configurations used in our study\. This includes the hyperparameter settings for our data synthesis agents \(both One\-Shot and Iterative\), the specific configurations for model training via LoRA, and the inference parameters employed for both the teacher and student models\.

### C\.1One\-Shot Agent Configuration

The One\-Shot Agent represents a baseline approach where the synthetic dataset is generated in a single pass without iterative refinement\. The configuration is designed to balance generation speed with robustness against potential code execution failures\.

Key configuration details include:

- •Resource Constraints: We limit the total runtime to 12 hours \(Max Time Hours\) and the dataset size to 2,000 samples \(Dataset Size\)\. We enforce a 3\-hour limit per code execution \(data synthesis\) and a strict budget of 50,000 total teacher API calls per task \(≤\\leq5,000 per attempt\)\.
- •Teacher Model: We utilize Qwen3\-30B\-A3B as the teacher model\. To maximize throughput during the data generation phase, we set the Api Concurrency to 80, allowing parallel processing of multiple data points via the vLLM deployment\.
- •Robustness Mechanism: We set Max Generation Attempts to 8\. This allows the agent to retry the generation process up to 8 times if the code fails to execute or produces invalid JSON output\.
- •Student Model Environment: The student model serves as the validator\. It is hosted locally using vLLM with Vllm Max Num Seqs set to 128 to optimize inference throughput during the validation phase\.

The exact configuration file is presented in Listing[C\.1](https://arxiv.org/html/2605.30407#A3.SS1)\.

Detailed Configuration for the One\-Shot Agent\.``` # General Settings common: DATASET_SIZE: 2000 MAX_TIME_HOURS: 12 EXECUTION_TIMEOUT_MIN: 180 MAX_GENERATION_ATTEMPTS: 8 # Teacher Model Settings (Generator) teacher: TEACHER_MODEL: Qwen3-30B-A3B TOTAL_API_LIMIT: 50000 SESSION_API_LIMIT: 5000 API_CONCURRENCY: 80 # Student Model Settings (Validator) student: LOCAL_MODEL: Llama-3_1-8B-Instruct VLLM_PORT: 8099 VLLM_MAX_NUM_SEQS: 128 VLLM_CONCURRENCY: 64 VLLM_MAX_TOKENS: 8192 ```

### C\.2Iterative Agent Configuration

The Iterative Agent introduces a feedback loop where the agent analyzes the performance of the trained model and refines the dataset accordingly\. This requires a more sophisticated configuration to handle the cycle of drafting, training, evaluating, and improving\.

Key distinctions in the configuration include:

- •Self\-Correction Loop: The agent runs for at most 30 iterations\. Within each iteration, the agent may encounter errors in its proposed code\. We define Max Debug Attempts as 3 for self\-correction within a single cycle\.
- •Resource Allocation: The system adheres to the global 12\-hour time limit and monitors the Total Api Limit of 50,000 calls to decide whether to continue refining or finalize the submission\. Each complete iteration cycle takes approximately 1–2 hours\.

The specific parameters are detailed in Listing[C\.2](https://arxiv.org/html/2605.30407#A3.SS2)\.

Detailed Configuration for the Iterative Agent\.``` # General Settings common: DATASET_SIZE: 2000 MAX_TIME_HOURS: 12 MAX_ITERATIONS: 30 MAX_DEBUG_ATTEMPTS: 3 # Teacher Model (Data Generator) teacher: TEACHER_MODEL: Qwen3-30B-A3B API_CONCURRENCY: 80 # Agent Core (Controller) iterative-agent: AGENT_MODEL: your_agent_model env_vars: <<: [ *common_settings, *teacher_config, *student_config ] ```

### C\.3Model Training Parameters

All synthetic datasets are evaluated by training a student model using Supervised Fine\-Tuning \(SFT\)\. We adopt a parameter\-efficient approach using Low\-Rank Adaptation \(LoRA\) on 2×\\timesH100 GPUs\.

The complete training configuration is provided in Listing[C\.3](https://arxiv.org/html/2605.30407#A3.SS3)\. And the training configuration is standardized as follows:

- •Optimization Strategy: We use the AdamW optimizer with a learning rate of1\.0×10−41\.0\\times 10^\{\-4\}and a cosine learning rate scheduler\. A warmup ratio of 0\.1 is applied to stabilize the early training phase\.
- •LoRA Configuration: We target all linear layers \(Lora Target: all\) with a rank of 8 \(Lora Rank\) and an alpha of 16 \(Lora Alpha\)\. This configuration provides a good balance between adaptation capacity and parameter efficiency\.
- •Compute Efficiency: To accommodate hardware constraints, we set the per\-device batch size to 1 but employ Gradient Accumulation Steps of 8\. Training is performed in Bf16 precision to reduce memory usage\. The model is trained for 3 epochs with evaluations performed every 500 steps\.

Hyperparameters for Model Training \(SFT with LoRA\)\.``` # Training Stage stage: sft finetuning_type: lora lora_target: all lora_rank: 8 lora_alpha: 16 lora_dropout: 0 # Hyperparameters learning_rate: 1.0e-4 num_train_epochs: 3 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true # Batch Size & Gradient Accumulation per_device_train_batch_size: 1 gradient_accumulation_steps: 8 cutoff_len: 2048 ```

### C\.4Model Inference Parameters

The inference process involves two distinct phases: data generation \(Teacher\) and model evaluation \(Student\)\. The parameters for each phase are optimized for their specific objectives\.

#### C\.4\.1Teacher Model Inference

The teacher model \(Qwen3\-30B\-A3B\) is deployed on separate 2×\\timesH100 GPUs via vLLM\. The inference parameters are managed dynamically by the agent code:

- •Concurrency: We utilize an asyncio\.Semaphore with a limit of 80 \(Api Concurrency\) to maximize generation throughput against the vLLM server\.
- •Sampling: The teacher model uses standard sampling parameters to ensure diversity in the generated synthetic data\.

#### C\.4\.2Student Model Inference \(Evaluation\)

The student model is deployed locally using the vLLM engine\. The configuration focuses on memory stability and evaluation throughput:

- •System Configuration: We explicitly set Gpu Memory Utilization to 0\.85 to reserve GPU memory for activation overheads, preventing Out\-Of\-Memory \(OOM\) errors during long\-sequence processing\. The Max Num Seqs is set to 128 to fully utilize the GPU’s parallel processing capacity\.
- •Tensor Parallelism: The system automatically detects the number of available GPUs and scales the Tensor Parallel Size accordingly\.
- •Dynamic Sampling: During evaluation, sampling parameters \(e\.g\., temperature\) are not hardcoded but are generated dynamically based on the specific dataset requirements \(defined in evaluate\.py\)\.

These settings are summarized in Listing[C\.4\.2](https://arxiv.org/html/2605.30407#A3.SS4.SSS2)\.

Inference Parameters for Teacher and Student Models\.``` # Teacher Model teacher_inference: model: Qwen3-30B-A3B deployment: vLLM (2x H100) concurrency_limit: 80 temperature: default # Student Model student_system: engine: vLLM gpu_memory_utilization: 0.85 max_model_len: 8192 max_num_seqs: 128 tensor_parallel_size: auto student_sampling: temperature: dynamic top_p: dynamic ```

### C\.5Human Involvement Analysis Setup

Across all settings, we fix the teacher model to Qwen3\-30B\-A3B\. We cap the API budget at 5,000 calls forIterative Agent, and assignDataFlowwith official configuration of 6,000 calls\. Each final synthesized training set contains 2,000 examples\. The student model and training & inference parameters are kept identical across all settings\.

## Appendix DPlatform Design

Our benchmark is accompanied by an execution platform that enables agents to synthesize data and run end\-to\-end model training within a controlled environment \(Figure[2](https://arxiv.org/html/2605.30407#S2.F2)\(a\)\)\.

##### Agent Toolkit

We provide a set of programmatic tools that agents can directly invoke, includingcheck\_submission\(format and schema validation\),train\_model\(fine\-tuning on the submitted data\),evaluate\_dataset\(public set evaluation and bad case demonstration\), andapi\_count\(API\-usage tracking\)\. We additionally ship a lightweight library with pre\-defined helper functions for common operations during code generation, such as batched API calls and robust answer parsing\. These tools allow an LLM agent to implement its own data\-synthesis logic while seamlessly integrating with the full pipeline\.

##### Execution & Evaluation Environment

We offer an isolated execution environment that separates both \(i\) the per\-run workspace and \(ii\) the runtime execution context at the thread level, enabling safe concurrent runs while preventing data contamination\. During execution, the platform continuously monitors API calls and remaining time, automatically terminating runs that exceed the prescribed budgets\.

After each run completes, we replay the evaluation on the finalsubmission\.jsonby re\-training the model on the submitted dataset and re\-evaluating it on the held\-out private test set, ensuring that the reported gains are attributable to the data\.

## Appendix EGeneralization Across Different Teacher\-Student Configurations

In our main experiments, we employ a fixed teacher\-student model configuration\. This serves as a controlled evaluation setup to provide a consistent scale for comparison, following common practices in prior empirical studies \(e\.g\., using a fixed user simulator inτ2\\tau^\{2\}\-BenchBarres et al\. \([2025](https://arxiv.org/html/2605.30407#bib.bib1)\)\)\.

To demonstrate that our findings are robust and not overfit to a specific model pair, we conduct additional experiments using the GPT\-5\.2 agent under varying teacher and student configurations onSciencetask\. Results summarized in Table[5](https://arxiv.org/html/2605.30407#A5.T5)\.

Table 5:Performance gain of the GPT\-5\.2 agent across different teacher and student model configurations\. The iterative approach consistently outperforms the one\-shot baseline regardless of the underlying models\.As shown in Table[5](https://arxiv.org/html/2605.30407#A5.T5), while the absolute value of the Gain \(%\) naturally varies depending on the base capabilities of the chosen models, the overall trends remain highly consistent\. Specifically, the iterative agent consistently yields positive gains and maintains a significant performance gap over the one\-shot baseline across all configurations\. These results confirm that the fixed teacher\-student setting in our main text successfully serves its purpose of providing a consistent evaluation scale, and the effectiveness of autonomous data engineering generalizes well to other model combinations\.

## Appendix FRunning Example of Iterative Agent

To clearly demonstrate how theIterative Agentoptimizes its own data synthesis pipeline, we have selected an optimization example ofGPT\-5\.2on theSciencetask to illustrate the specific details of the iterative improvement\.

Here is the original plan drafted by the model:

Original Plan of GPT\-5\.2 in Science Task\.1\. Define the instruction format:•Define a single fixed"instruction"string matchingsample\_submission\.jsonstyle2\. Build topic templates spanning:•Mathematics:calculus \(integrals, series, multivariable\), differential equations, linear algebra, probability/statistics•Physics:mechanics, E&M, circuits, waves/optics, thermodynamics•Chemistry:gases \(vdW/ideal\), equilibrium, kinetics, thermodynamics, electrochemistry, colligative properties3\. Generate parameterized prompts \(per template\) requiring the teacher to:•Write a textbook\-style problem with given numbers and a specified answer unit•Provide a step\-by\-step solution including computations•End with:The answer is therefore \\boxed\{\.\.\.\}\.where\.\.\.is a decimal with exactly three digits4\. Bulk generation:•Useapi\_generate\_batchwith batch size 80 to produce∼\\sim2200 candidates \(buffer for filtering\)5\. Post\-process and filter:•Keep only items whose output contains at least one\\boxed\{\.\.\.\}and whose last\\boxed\{\}matches a decimal number with 3 digits•Ensure output ends with the exact final sentence•Ensure fieldsinstruction,input,outputare non\-empty strings6\. Regenerate if needed:•If<2000<2000valid samples, regenerate only the deficit with stricter formatting reminders7\. Finalize dataset:•Shuffle, truncate to first 2000, and write to\.\./submission/submission\.json8\. Checkpointing:•Save checkpoints every 200 valid samples to avoid data loss

Based on self\-reflection and environmental feedback,GPT\-5\.2proposed the optimization approach shown below based on the original solution:

Improvement Plan of GPT\-5\.2 in Science Task\.1\. Improve solution quality by:•Generating more focused, step\-by\-step solutions without excessive verbosity•Ensuring all calculations are complete and accurate•Requiring clear final answers in proper boxed format2\. Enhance problem diversity by:•Creating problems across broader difficulty ranges•Including more applied/real\-world scientific scenarios•Balancing theoretical and computational problems3\. Better prompt engineering:•More specific instructions for concise, accurate solutions•Explicit requirements for complete calculations•Template\-based solution structure to ensure consistency4\. Quality control:•Filter out incomplete or malformed solutions•Validate that solutions have proper final answers•Ensure mathematical notation is correct

Guided by the improvement plan above, the model generated a more complex and robust code version that covers a broader scope and includes more challenging questions\. The improved code corresponding to this optimization is shown below\.

Listing 1:Correnponding Code to Improve Data Quality1importos

2importre

3importjson

4importrandom

5importhashlib

6fromtypingimportList,Dict,Tuple

7fromservices\.libraryimportapi\_generate\_batch,get\_available\_api\_calls

8

9OUT\_PATH="\.\./submission/submission\.json"

10API\_BATCH=80

11TARGET\_N=2000

12MAX\_API\_CALLS\_THIS\_ITER=5000

13

14INSTRUCTION=\(

15"Pleaseprovideaclearandstep\-by\-stepsolutionforascientificprobleminthecategoriesofChemistry,Physics,orMathematics\."

16"Theproblemwillspecifytheunitofmeasurement,whichshouldnotbeincludedintheanswer\."

17"Expressthefinalanswerasadecimalnumberwiththreedigitsafterthedecimalpoint\."

18"Concludetheanswerbystating\\"Theansweristherefore\\\\\\\\boxed\{\[ANSWER\]\}\.\\""

19\)

20

21BOX\_RE=re\.compile\(r"\\\\boxed\\\{\(\[^\}\]\*\)\\\}"\)

22THREE\_DEC\_RE=re\.compile\(r"^\-?\\d\+\\\.\\d\{3\}$"\)

23UNIT\_SENT\_RE=re\.compile\(r"Theunitoftheansweris\\s\+\[^\.\]\+\\\.",re\.IGNORECASE\)

24

25defextract\_last\_boxed\(text:str\):

26boxes=BOX\_RE\.findall\(textor""\)

27returnboxes\[\-1\]\.strip\(\)ifboxeselseNone

28

29defrequired\_final\_sentence\(boxed\_value:str\)\-\>str:

30returnf"Theansweristherefore\\\\\\\\boxed\{\{\{boxed\_value\}\}\}\."

31

32defends\_with\_required\_sentence\(text:str\)\-\>bool:

33last=extract\_last\_boxed\(text\)

34iflastisNone:

35returnFalse

36returntext\.rstrip\(\)\.endswith\(required\_final\_sentence\(last\)\)

37

38defis\_valid\_item\(item:Dict\[str,str\]\)\-\>bool:

39ifnotisinstance\(item,dict\):

40returnFalse

41forkin\("instruction","input","output"\):

42ifknotinitemornotisinstance\(item\[k\],str\)ornotitem\[k\]\.strip\(\):

43returnFalse

44

45out=item\["output"\]

46last=extract\_last\_boxed\(out\)

47iflastisNoneornotTHREE\_DEC\_RE\.match\(last\):

48returnFalse

49ifnotends\_with\_required\_sentence\(out\):

50returnFalse

51

52

53inp=item\["input"\]

54if"Question:"notininp:

55returnFalse

56ifnotUNIT\_SENT\_RE\.search\(inp\):

57returnFalse

58

59

60returnTrue

61

62defnormalize\_text\(s:str\)\-\>str:

63returnre\.sub\(r"\\s\+","",\(sor""\)\.strip\(\)\)

64

65deffingerprint\_input\(inp:str\)\-\>str:

66returnhashlib\.md5\(normalize\_text\(inp\)\.encode\("utf\-8"\)\)\.hexdigest\(\)

67

68defsave\_json\(data:List\[Dict\[str,str\]\],path:str\):

69os\.makedirs\(os\.path\.dirname\(path\),exist\_ok=True\)

70withopen\(path,"w",encoding="utf\-8"\)asf:

71json\.dump\(data,f,ensure\_ascii=False,indent=2\)

72

73defload\_existing\(path:str\)\-\>List\[Dict\[str,str\]\]:

74ifnotos\.path\.exists\(path\):

75return\[\]

76withopen\(path,"r",encoding="utf\-8"\)asf:

77obj=json\.load\(f\)

78returnobjifisinstance\(obj,list\)else\[\]

79

80

81TOPICS=\[

82

83\("Mathematics","Calculus","definiteintegral\(substitution/parts\)"\),

84\("Mathematics","Calculus","improperintegral/convergence"\),

85\("Mathematics","Series","Taylor/Maclaurinapproximation"\),

86\("Mathematics","Multivariable","doubleintegral/changeofvariables"\),

87\("Mathematics","Optimization","Lagrangemultipliers"\),

88\("Mathematics","DifferentialEquations","first\-orderlinearIVPwithapplication"\),

89\("Mathematics","DifferentialEquations","second\-orderODE\(mass\-spring/damping\)"\),

90\("Mathematics","LinearAlgebra","eigenvalues/eigenvectorsnumeric"\),

91\("Mathematics","Probability","expectation/variancecontinuousRV"\),

92\("Mathematics","Statistics","MLE/confidenceintervalnumeric"\),

93

94\("Physics","Mechanics","work\-energywithfriction/incline/spring"\),

95\("Physics","Mechanics","momentum/impulsecollision"\),

96\("Physics","Mechanics","circularmotion/bankedcurve"\),

97\("Physics","Electricity&Magnetism","electricfield/potentialsuperposition"\),

98\("Physics","Circuits","RCtransienttimetoreachavoltage"\),

99\("Physics","Circuits","DCcircuit\(equivalentresistance/current\)"\),

100\("Physics","Waves","standingwaves/beatfrequency"\),

101\("Physics","Thermodynamics","idealgasprocess\(W,Q,U\)"\),

102\("Physics","Optics","thinlens/mirrorimaging"\),

103

104\("Chemistry","Thermodynamics","Gandequilibriumconstant"\),

105\("Chemistry","Gases","vanderWaals/compressionfactor"\),

106\("Chemistry","Equilibrium","bufferpH/K\_a,K\_b"\),

107\("Chemistry","Kinetics","ratelaws/half\-life"\),

108\("Chemistry","Electrochemistry","Nernstequationcellpotential"\),

109\("Chemistry","Solutions","colligativeproperties\(T\_f/T\_b\)"\),

110\]

111

112UNITS\_BY\_DOMAIN=\{

113"Mathematics":\["unitless","s","m","kg","Pa","J"\],

114"Physics":\["m/s","m/s^2","N","J","W","C","V","Hz","K","Pa"\],

115"Chemistry":\["kJ/mol","J/mol","mol/L","atm","Pa","K","V","s","g/mol"\],

116\}

117

118defbuild\_generation\_prompt\(seed:int\)\-\>str:

119rnd=random\.Random\(seed\)

120domain,area,style=rnd\.choice\(TOPICS\)

121difficulty=rnd\.choice\(\["introductory","intermediate","advanced"\]\)

122scenario=rnd\.choice\(\[

123"appliedreal\-worldscenario",

124"textbook\-styletheoreticalscenario",

125"labmeasurementscenariowithgivenuncertaintiesignored",

126"engineeringestimationscenario",

127\]\)

128unit=rnd\.choice\(UNITS\_BY\_DOMAIN\[domain\]\)

129

130

131returnf"""

132CreateONESciBench\-styleexample\.

133

134Hardrequirements:

135\-Domain:\{domain\}\.Area:\{area\}\.Style:\{style\}\.Difficulty:\{difficulty\}\.

136\-Scenarioflavor:\{scenario\}\.

137\-WritetheQUESTIONfirst,startingwithexactly:"Question:"

138\-Thequestionmustbeself\-containedandincludeallconstantsneeded\.

139\-Thequestionmustincludeexactlyonesentence:"Theunitoftheansweris\{unit\}\."

140\(If\{unit\}isnotappropriateforyourquestion,replaceitwithanappropriateunitandkeepthesamesentencepattern\.\)

141\-ThenwritetheSOLUTIONwithaclearstructure:

1421\)Given/Find

1432\)Keyequations

1443\)Step\-by\-stepcalculation\(showintermediatenumericvalues\)

1454\)Finallineexactly:Theansweristherefore\\\\boxed\{\{X\.XXX\}\}\.

146\-Outputmustbeconcise:aimfor~12\-25lines;nofiller\.

147\-Finalboxedvaluemustbeadecimalwithexactlythreedigits;nounitsinthebox\.

148\-Ensurethelast\\\\boxed\{\{\.\.\.\}\}intheoutputisthefinalanswer\.

149

150ReturnJSONonly:

151\{\{"input":"\.\.\.","output":"\.\.\."\}\}

152"""\.strip\(\)

153

154defbuild\_repair\_prompt\(item:Dict\[str,str\]\)\-\>str:

155

156inp=item\.get\("input",""\)

157out=item\.get\("output",""\)

158returnf"""

159YouwillREPAIRadatasetexampleforSciBenchnumericanswering\.

160

161KeeptheQUESTIONtextexactlyas\-is\.

162RewritetheSOLUTIONtobecorrect,concise,andfullycomputed\.

163

164Hardconstraintsfortherewrittensolution:

165\-Useaclearstructure\(Given/Find,equations,calculations\)\.

166\-Donotbeoverlyverbose\(target~12\-25lines\)\.

167\-Thefinalanswermustbeadecimalwithexactlythreedigitsafterthedecimalpoint\.

168\-DoNOTincludeunitsintheboxedanswer\.

169\-ThelastlineMUSTbeexactly:Theansweristherefore\\\\boxed\{\{X\.XXX\}\}\.

170\-Ensurethelast\\\\boxed\{\{\.\.\.\}\}isthefinalnumericanswer\.

171

172ReturnJSONonly:

173\{\{"input":"<samequestion\>","output":"<newsolution\>"\}\}

174

175QUESTION:

176\{inp\}

177

178ORIGINALSOLUTION\(maybewrong/malformed\):

179\{out\}

180"""\.strip\(\)

181

182defcall\_teacher\_json\(prompts:List\[str\]\)\-\>List\[Dict\]:

183

184assertlen\(prompts\)==API\_BATCH,f"api\_generate\_batchmustbecalledwithbatchsize\{API\_BATCH\}"

185sys\_prompt="YouareameticulousSTEMtextbooksolutionwriter\.FollowconstraintsexactlyandoutputvalidJSONonly\."

186returnapi\_generate\_batch\(prompts,sys\_prompt=sys\_prompt,json=True\)

187

188defparse\_generated\(objs:List\)\-\>List\[Dict\[str,str\]\]:

189items=\[\]

190foroinobjs:

191ifisinstance\(o,dict\)and"input"inoand"output"ino:

192items\.append\(\{"instruction":INSTRUCTION,"input":o\["input"\],"output":o\["output"\]\}\)

193returnitems

194

195defdedup\_keep\_order\(items:List\[Dict\[str,str\]\]\)\-\>List\[Dict\[str,str\]\]:

196seen=set\(\)

197out=\[\]

198foritinitems:

199fp=fingerprint\_input\(it\.get\("input",""\)\)

200iffpinseen:

201continue

202seen\.add\(fp\)

203out\.append\(it\)

204returnout

205

206defmain\(\):

207random\.seed\(20260208\)

208

209existing=load\_existing\(OUT\_PATH\)

210print\(f"Loaded\{len\(existing\)\}rawitemsfrom\{OUT\_PATH\}"\)

211

212

213valid=\[\]

214invalid=\[\]

215foritinexisting:

216ifis\_valid\_item\(it\):

217valid\.append\(it\)

218else:

219

220ifisinstance\(it,dict\)and"input"initand"output"init:

221invalid\.append\(\{"instruction":INSTRUCTION,"input":it\.get\("input",""\),"output":it\.get\("output",""\)\}\)

222

223valid=dedup\_keep\_order\(valid\)

224print\(f"Validafterstrictfiltering\+dedup:\{len\(valid\)\};candidatesforrepair:\{len\(invalid\)\}"\)

225

226

227save\_json\(valid\[:TARGET\_N\],OUT\_PATH\)

228

229api\_calls\_used=0

230

231

232

233defdeficit\(\):

234returnmax\(0,TARGET\_N\-len\(valid\)\)

235

236repair\_pool=invalid\[:\]

237repair\_idx=0

238whiledeficit\(\)\>0andrepair\_idx<len\(repair\_pool\):

239ifapi\_calls\_used\>=MAX\_API\_CALLS\_THIS\_ITER:

240break

241ifget\_available\_api\_calls\(\)<=0:

242break

243

244batch\_items=repair\_pool\[repair\_idx:repair\_idx\+API\_BATCH\]

245iflen\(batch\_items\)<API\_BATCH:

246break

247repair\_idx\+=API\_BATCH

248

249prompts=\[build\_repair\_prompt\(it\)foritinbatch\_items\]

250objs=call\_teacher\_json\(prompts\)

251api\_calls\_used\+=1

252

253repaired=parse\_generated\(objs\)

254repaired\_valid=\[itforitinrepairedifis\_valid\_item\(it\)\]

255ifrepaired\_valid:

256valid\.extend\(repaired\_valid\)

257valid=dedup\_keep\_order\(valid\)

258

259iflen\(valid\)%200<len\(repaired\_valid\):

260save\_json\(valid\[:TARGET\_N\],OUT\_PATH\)

261print\(f"Checkpointafterrepair:\{min\(len\(valid\),TARGET\_N\)\}"\)

262

263print\(f"Repaircall\{api\_calls\_used\}:got\{len\(repaired\)\}parsed,\{len\(repaired\_valid\)\}valid;totalvalid\{len\(valid\)\}"\)

264

265

266gen\_seed\_base=900000

267gen\_round=0

268whiledeficit\(\)\>0:

269ifapi\_calls\_used\>=MAX\_API\_CALLS\_THIS\_ITER:

270break

271ifget\_available\_api\_calls\(\)<=0:

272break

273

274prompts=\[build\_generation\_prompt\(gen\_seed\_base\+gen\_round\*API\_BATCH\+i\)foriinrange\(API\_BATCH\)\]

275objs=call\_teacher\_json\(prompts\)

276api\_calls\_used\+=1

277gen\_round\+=1

278

279items=parse\_generated\(objs\)

280items\_valid=\[itforitinitemsifis\_valid\_item\(it\)\]

281ifitems\_valid:

282valid\.extend\(items\_valid\)

283valid=dedup\_keep\_order\(valid\)

284

285iflen\(valid\)%200<len\(items\_valid\)orlen\(valid\)\>=TARGET\_N:

286save\_json\(valid\[:TARGET\_N\],OUT\_PATH\)

287print\(f"Checkpointaftergen:\{min\(len\(valid\),TARGET\_N\)\}"\)

288

289print\(f"Gencall\{api\_calls\_used\}:parsed\{len\(items\)\},valid\{len\(items\_valid\)\};totalvalid\{len\(valid\)\}"\)

290

291

292

293

294random\.shuffle\(valid\)

295final=valid\[:TARGET\_N\]

296save\_json\(final,OUT\_PATH\)

297print\(f"Finalsaved:\{len\(final\)\}itemsto\{OUT\_PATH\}\.APIcallsusedthisiteration:\{api\_calls\_used\}"\)

298

299if\_\_name\_\_=="\_\_main\_\_":

300main\(\)

## Appendix GCode Snippets for Failure Analysis

In this section, we provide the specific code implementations generated by the agents that correspond to the failure cases discussed in Section[4\.3](https://arxiv.org/html/2605.30407#S4.SS3)\.

### G\.1Science Task: Partial Distribution Shift\.

Listing[2](https://arxiv.org/html/2605.30407#LST2)presents thecode\.pygenerated during thefrom\-scratchiterative Science task\.

Listing 2:Agent Code That Induced Data Distribution Shift1importos

2importre

3importjson

4importmath

5importrandom

6fromtypingimportList,Dict,Any,Optional

7fromservices\.libraryimportapi\_generate\_batch,get\_available\_api\_calls

8

9BEST\_PATH="\.\./submission/submission\_best\.json"

10OUT\_PATH="\.\./submission/submission\.json"

11

12INSTRUCTION=\(

13"Pleaseprovideaclearandstep\-by\-stepsolutionforascientificprobleminthecategories"

14"ofChemistry,Physics,orMathematics\.Theproblemwillspecifytheunitofmeasurement,"

15"whichshouldnotbeincludedintheanswer\.Expressthefinalanswerasadecimalnumber"

16"withthreedigitsafterthedecimalpoint\.Concludetheanswerbystating"

17"\\"Theansweristherefore\\\\\\\\boxed\{\[ANSWER\]\}\.\\""

18\)

19

20

21BATCH\_SIZE=65

22

23

24BOX\_RE=re\.compile\(r"\\\\\+boxed\\\{\(\[^\}\]\*\)\\\}"\)

25FINAL\_LINE\_RE=re\.compile\(r"Theansweristherefore\\s\*\\\\\+boxed\\\{\(\[^\}\]\*\)\\\}\\\.\\s\*$"\)

26

27NUM\_RE=re\.compile\(

28r"^\[\\s\]\*"

29r"\(?P<sign\>\[\+\-\]?\)"

30r"\(?P<num\>\(?:\\d\+\(?:\\\.\\d\*\)?\|\\\.\\d\+\)\)"

31r"\(?:\[eE\]\(?P<exp\>\[\+\-\]?\\d\+\)\)?"

32r"\[\\s\]\*$"

33\)

34

35UNIT\_CUE\_RE=re\.compile\(

36r"\(unit\\s\+of\\s\+the\\s\+answer\\s\+is\|units?\\s\*:\|answer\\s\+should\\s\+be\\s\+in\|the\\s\+unit\\s\+is\|\\bin\\s\+\\$?\\\\?\[a\-zA\-Z\]\+\)",

37re\.IGNORECASE,

38\)

39

40

41BAD\_TOKEN\_RE=re\.compile\(

42r"\(\\\\nrac\|\[^\\\\\]rac\\\{\|boxed\\\{\\s\*\\\}\|@@\|\\?\\?\|nan\|inf\)",re\.IGNORECASE

43\)

44

45defload\_json\_list\(path:str\)\-\>List\[Dict\[str,Any\]\]:

46ifnotos\.path\.exists\(path\):

47return\[\]

48try:

49withopen\(path,"r",encoding="utf\-8"\)asf:

50data=json\.load\(f\)

51returndataifisinstance\(data,list\)else\[\]

52exceptException:

53return\[\]

54

55defsave\(path:str,data:List\[Dict\[str,Any\]\]\)\-\>None:

56os\.makedirs\(os\.path\.dirname\(path\),exist\_ok=True\)

57withopen\(path,"w",encoding="utf\-8"\)asf:

58json\.dump\(data,f,ensure\_ascii=False,indent=2\)

59

60defparse\_number\(s:str\)\-\>Optional\[float\]:

61s=s\.strip\(\)

62ifnotNUM\_RE\.match\(s\):

63returnNone

64try:

65x=float\(s\)

66ifmath\.isnan\(x\)ormath\.isinf\(x\):

67returnNone

68returnx

69exceptException:

70returnNone

71

72defnormalize\_final\_line\(output:str\)\-\>Optional\[str\]:

73out=output\.rstrip\(\)

74boxes=BOX\_RE\.findall\(out\)

75ifnotboxes:

76returnNone

77last\_box\_raw=boxes\[\-1\]\.strip\(\)

78val=parse\_number\(last\_box\_raw\)

79ifvalisNone:

80returnNone

81val\_3=f"\{val:\.3f\}"

82lines=out\.splitlines\(\)

83ifnotlines:

84returnNone

85lines\[\-1\]=f"Theansweristherefore\\\\boxed\{\{\{val\_3\}\}\}\."

86return"\\n"\.join\(lines\)

87

88defvalidate\_entry\(inp:Any,out:Any\)\-\>bool:

89ifnotisinstance\(inp,str\)ornotisinstance\(out,str\):

90returnFalse

91inp=inp\.strip\(\)

92out=out\.strip\(\)

93

94ifnotinp\.startswith\("Question:"\):

95returnFalse

96ifnotUNIT\_CUE\_RE\.search\(inp\):

97returnFalse

98

99ifBAD\_TOKEN\_RE\.search\(out\):

100returnFalse

101

102

103ifnotFINAL\_LINE\_RE\.search\(out\):

104returnFalse

105

106

107boxes=BOX\_RE\.findall\(out\)

108ifnotboxes:

109returnFalse

110ifparse\_number\(boxes\[\-1\]\)isNone:

111returnFalse

112

113

114iflen\(out\.splitlines\(\)\)<6:

115returnFalse

116

117returnTrue

118

119defsanitize\_item\(inp:Any,out:Any\)\-\>Optional\[Dict\[str,str\]\]:

120ifnotisinstance\(inp,str\)ornotisinstance\(out,str\):

121returnNone

122norm=normalize\_final\_line\(out\)

123ifnormisNone:

124returnNone

125inp\_s=inp\.strip\(\)

126out\_s=norm\.strip\(\)

127ifnotvalidate\_entry\(inp\_s,out\_s\):

128returnNone

129return\{"instruction":INSTRUCTION,"input":inp\_s,"output":out\_s\}

130

131SYS\_PROMPT\_TARGETED=\(

132"Youaregeneratinginstruction\-tuningdataforsolvingcollege\-levelscientificproblems\.\\n"

133"ReturnONLYvalidJSON\(nomarkdown,noextratext\)\.\\n"

134"Schema:\{\\"input\\":string,\\"output\\":string\}\\n"

135"Hardrules:\\n"

136"\-inputMUSTstartwithexactly’Question:’\\n"

137"\-inputMUSTexplicitlystatetheunitofthefinalanswerinasentence:’Theunitoftheansweris\.\.\.\.’\\n"

138"\-outputMUSTbeacorrectstep\-by\-stepsolutionwithunitconversionsshown\.\\n"

139"\-outputMUSThaveatleast6lines\.\\n"

140"\-outputMUSTendwithEXACTlastline:Theansweristherefore\\\\\\\\boxed\{NUMBER\}\.\\n"

141"\-NUMBERmustbenumericonly\(nounits\)andpreferdecimalwiththreedigits\.\\n"

142"Focusheavilyonthesetopics\(rotateamongthem\):\\n"

143"1\)Two\-levelsystem/Boltzmannpopulationratioswithenergiesincm^\-1;usehc/kB=1\.4388cm\*K\.\\n"

144"2\)Photoelectric/ionizationenergy:wavelengthinnmtoeV;subtractelectronKEfromv\.\\n"

145"3\)Coriolisdeflectionforprojectileatgivenlatitude;useEarthrotationOmega=7\.292e\-5s^\-1\.\\n"

146"4\)Manometerpressureconversions\(cmH2O,mmHg\)andcomputingRfromPV=nRT;avoidfactor\-1000errors\.\\n"

147"5\)Orbitalmechanicsenergychangesbetweencircularorbits;usemu=3\.986e14m^3/s^2,Re=6\.371e6m\.\\n"

148"Usestandardconstantswhenneeded:g=9\.81,kB=1\.381e\-23J/K,h=6\.626e\-34Js,c=2\.998e8m/s,"

149"e=1\.602e\-19C,1eV=1\.602e\-19J\.\\n"

150\)

151

152SYS\_PROMPT\_GENERAL=\(

153"Youaregeneratinginstruction\-tuningdataforsolvingcollege\-levelscientificproblems\.\\n"

154"ReturnONLYvalidJSON\(nomarkdown,noextratext\)\.\\n"

155"Schema:\{\\"input\\":string,\\"output\\":string\}\\n"

156"Rules:\\n"

157"\-inputMUSTstartwithexactly’Question:’\\n"

158"\-inputMUSTexplicitlystatetheunitofthefinalanswerinasentence:’Theunitoftheansweris\.\.\.\.’\\n"

159"\-outputMUSTbeacorrectstep\-by\-stepsolution\.\\n"

160"\-outputMUSThaveatleast6lines\.\\n"

161"\-outputMUSTendwithEXACTlastline:Theansweristherefore\\\\\\\\boxed\{NUMBER\}\.\\n"

162"\-NUMBERmustbenumericonlyandpreferdecimalwiththreedigits\.\\n"

163"Coverabroadmixofundergraduatetopicsacrosscalculus,ODEs,probability/statistics,mechanics,E&M,circuits,"

164"thermodynamics,equilibrium,kinetics,electrochemistry,optics\.\\n"

165"Usestandardconstantswhenneeded:g=9\.81,R=8\.314,k=8\.988e9,h=6\.626e\-34,c=2\.998e8\.\\n"

166\)

167

168defbuild\_prompt\(seed:int,mode:str\)\-\>str:

169random\.seed\(seed\)

170ifmode=="targeted":

171focus=random\.choice\(\[

172"two\-levelBoltzmannpopulationwithenergyseparationincm^\-1",

173"photoelectric/ionizationenergyfromwavelengthandelectronspeed",

174"Coriolisdeflectionforaprojectilefiredduenorth/southatgivenlatitude",

175"manometerpressureconversionandcomputinggasconstantRfrommeasurements",

176"orbitalmechanics:energyrequiredtomovebetweencircularorbitsincludingsynchronousorbit"

177\]\)

178return\(

179"GenerateONEoriginal,solvable,college\-levelquantitativeproblemanditscorrectstep\-by\-stepsolution\.\\n"

180f"TopicMUSTbe:\{focus\}\.\\n"

181"Hardconstraints:\\n"

182"1\)ReturnexactlyoneJSONobject:\{\\"input\\":\.\.\.,\\"output\\":\.\.\.\}\.\\n"

183"2\)inputstartswith’Question:’andexplicitlystatestheunitin:’Theunitoftheansweris\.\.\.\.’\\n"

184"3\)outputshowsallkeyequationsandunitconversionsandhasatleast6lines\.\\n"

185"4\)outputendswithEXACTlastline:Theansweristherefore\\\\\\\\boxed\{NUMBER\}\.\\n"

186"5\)NUMBERisnumericonly,nounits,andpreferthreedecimals\.\\n"

187f"Seedtag:\{seed\}"

188\)

189else:

190return\(

191"GenerateONEoriginal,solvable,college\-levelscientificproblem\(Math/Physics/Chemistry\)"

192"anditscorrectstep\-by\-stepsolution\.\\n"

193"Ensuremulti\-stepreasoningandintermediatecomputations;avoidtrivialone\-liners\.\\n"

194"Hardconstraints:\\n"

195"1\)ReturnexactlyoneJSONobject:\{\\"input\\":\.\.\.,\\"output\\":\.\.\.\}\.\\n"

196"2\)inputstartswith’Question:’andexplicitlystatestheunitin:’Theunitoftheansweris\.\.\.\.’\\n"

197"3\)outputhasatleast6lines\.\\n"

198"4\)outputendswithEXACTlastline:Theansweristherefore\\\\\\\\boxed\{NUMBER\}\.\\n"

199"5\)NUMBERisnumericonly,nounits,andpreferthreedecimals\.\\n"

200f"Seedtag:\{seed\}"

201\)

202

203defmain\(target\_total:int=2000,max\_calls\_cap:int=5000\)\-\>None:

204

205best=load\_json\_list\(BEST\_PATH\)

206data:List\[Dict\[str,str\]\]=\[\]

207seen\_inputs=set\(\)

208

209fordinbest:

210ifnotisinstance\(d,dict\):

211continue

212item=sanitize\_item\(d\.get\("input"\),d\.get\("output"\)\)

213ifitemisNone:

214continue

215ifitem\["input"\]inseen\_inputs:

216continue

217data\.append\(item\)

218seen\_inputs\.add\(item\["input"\]\)

219

220save\(OUT\_PATH,data\)

221

222

223iflen\(data\)\>=target\_total:

224save\(OUT\_PATH,data\[:target\_total\]\)

225return

226

227available\_calls=get\_available\_api\_calls\(\)

228calls\_budget=min\(max\_calls\_cap,available\_calls\)

229

230seed\_base=20260129

231call\_idx=0

232

233whilelen\(data\)<target\_totalandcall\_idx<calls\_budget:

234mode="targeted"if\(call\_idx%2==0\)else"general"

235sys\_prompt=SYS\_PROMPT\_TARGETEDifmode=="targeted"elseSYS\_PROMPT\_GENERAL

236

237prompts=\[build\_prompt\(seed\_base\+call\_idx\*1000\+i,mode=mode\)foriinrange\(BATCH\_SIZE\)\]

238responses=api\_generate\_batch\(prompts,sys\_prompt=sys\_prompt,json=True\)

239

240forrespinresponses:

241ifnotisinstance\(resp,dict\):

242continue

243item=sanitize\_item\(resp\.get\("input"\),resp\.get\("output"\)\)

244ifitemisNone:

245continue

246ifitem\["input"\]inseen\_inputs:

247continue

248data\.append\(item\)

249seen\_inputs\.add\(item\["input"\]\)

250iflen\(data\)\>=target\_total:

251break

252

253save\(OUT\_PATH,data\)

254call\_idx\+=1

255

256save\(OUT\_PATH,data\[:target\_total\]\)

257

258if\_\_name\_\_=="\_\_main\_\_":

259main\(target\_total=2000,max\_calls\_cap=5000\)

### G\.2Code Task: Naive Rule\-based Augmentation\.

Listing[3](https://arxiv.org/html/2605.30407#LST3)presents thecode\.pygenerated during thefrom\-seediterative Code task\.

Listing 3:Agent Code That Employed Naive Rule\-based Augmentation1importos

2importjson

3importre

4importrandom

5importuuid

6importargparse

7importlogging

8importmath

9fromtypingimportList,Dict,Any,Optional

10fromdatetimeimportdatetime

11fromcollectionsimportCounter

12

13

14logging\.basicConfig\(

15level=logging\.INFO,

16format=’%\(asctime\)s\-\[DATA\_PIPELINE\]\-%\(levelname\)s\-%\(message\)s’,

17datefmt=’%Y\-%m\-%d%H:%M:%S’

18\)

19logger=logging\.getLogger\(\_\_name\_\_\)

20

21

22DEFAULT\_TARGET\_SIZE=5000

23PERTURBATION\_RANGE=\(\-5,5\)

24MUTATION\_PROBABILITY=0\.8

25MIN\_VALUE\_CLAMP=1

26

27NUMBER\_PATTERN=re\.compile\(r’\(?<\!\[a\-zA\-Z\_\]\)\\d\+\(?\!\[a\-zA\-Z\_\]\)’\)

28

29classDataAugmenter:

30def\_\_init\_\_\(self,seed\_path:str,output\_path:str,target\_size:int\):

31self\.seed\_path=seed\_path

32self\.output\_path=output\_path

33self\.target\_size=target\_size

34self\.stats=Counter\(\)

35self\.generated\_ids=set\(\)

36

37defload\_seeds\(self\)\-\>List\[Dict\[str,Any\]\]:

38

39ifnotos\.path\.exists\(self\.seed\_path\):

40logger\.error\(f"Seedfilenotfoundat\{self\.seed\_path\}"\)

41return\[\]

42

43try:

44withopen\(self\.seed\_path,’r’,encoding=’utf\-8’\)asf:

45data=json\.load\(f\)

46

47valid\_seeds=\[

48itemforitemindata

49ifall\(kinitemforkin\(’instruction’,’input’,’output’\)\)

50\]

51logger\.info\(f"Loaded\{len\(valid\_seeds\)\}validseedsfrom\{len\(data\)\}totalentries\."\)

52returnvalid\_seeds

53exceptjson\.JSONDecodeErrorase:

54logger\.critical\(f"FailedtoparseseedJSON:\{e\}"\)

55return\[\]

56

57def\_stochastic\_perturbation\(self,text:str\)\-\>str:

58defreplace\_match\(match\):

59ifrandom\.random\(\)\>MUTATION\_PROBABILITY:

60returnmatch\.group\(\)

61

62try:

63original\_val=int\(match\.group\(\)\)

64

65shift=random\.randint\(\*PERTURBATION\_RANGE\)

66

67new\_val=max\(MIN\_VALUE\_CLAMP,original\_val\+shift\)

68

69returnstr\(new\_val\)

70exceptValueError:

71returnmatch\.group\(\)

72

73returnNUMBER\_PATTERN\.sub\(replace\_match,text\)

74

75defgenerate\_variant\(self,seed\_item:Dict\[str,Any\]\)\-\>Optional\[Dict\[str,Any\]\]:

76try:

77new\_input=self\.\_stochastic\_perturbation\(seed\_item\[’input’\]\)

78

79new\_output=self\.\_stochastic\_perturbation\(seed\_item\[’output’\]\)

80

81variant=\{

82"instruction":seed\_item\[’instruction’\],

83"input":new\_input,

84"output":new\_output,

85"meta":\{

86"origin":"augmented",

87"parent\_id":seed\_item\.get\("id","unknown"\),

88"aug\_method":"regex\_perturbation\_v2"

89\}

90\}

91

92ifnew\_input==seed\_item\[’input’\]andnew\_output==seed\_item\[’output’\]:

93self\.stats\[’skipped\_no\_change’\]\+=1

94returnNone

95

96returnvariant

97

98exceptExceptionase:

99logger\.warning\(f"Failedtogeneratevariant:\{e\}"\)

100self\.stats\[’errors’\]\+=1

101returnNone

102

103defvalidate\_dataset\(self,dataset:List\[Dict\[str,Any\]\]\)\-\>List\[Dict\[str,Any\]\]:

104

105unique\_data=\[\]

106seen\_hashes=set\(\)

107

108foritemindataset:

109content\_hash=hash\(f"\{item\[’input’\]\}\|\{item\[’output’\]\}"\)

110

111ifcontent\_hashinseen\_hashes:

112self\.stats\[’duplicates\_removed’\]\+=1

113continue

114

115seen\_hashes\.add\(content\_hash\)

116unique\_data\.append\(item\)

117

118returnunique\_data

119

120defrun\(self\):

121start\_time=datetime\.now\(\)

122logger\.info\(f"Startingpipelineat\{start\_time\}"\)

123

124seeds=self\.load\_seeds\(\)

125ifnotseeds:

126return

127

128dataset=\[\]

129dataset\.extend\(seeds\)

130logger\.info\(f"Initializedwith\{len\(seeds\)\}seedsamples\."\)

131

132pbar\_interval=self\.target\_size//10

133whilelen\(dataset\)<self\.target\_size:

134base\_seed=random\.choice\(seeds\)

135variant=self\.generate\_variant\(base\_seed\)

136

137ifvariant:

138dataset\.append\(variant\)

139self\.stats\[’generated’\]\+=1

140

141current\_count=len\(dataset\)

142ifcurrent\_count%pbar\_interval==0andcurrent\_count\>0:

143logger\.info\(f"Progress:\{current\_count\}/\{self\.target\_size\}samplesgenerated\.\.\."\)

144

145logger\.info\("Runningfinalvalidationanddeduplication\.\.\."\)

146final\_dataset=self\.validate\_dataset\(dataset\)

147

148final\_dataset=final\_dataset\[:self\.target\_size\]

149

150self\.save\_data\(final\_dataset\)

151

152duration=datetime\.now\(\)\-start\_time

153logger\.info\(f"Pipelinecompletedin\{duration\}\."\)

154logger\.info\(f"Statistics:\{dict\(self\.stats\)\}"\)

155

156defsave\_data\(self,data:List\[Dict\[str,Any\]\]\):

157os\.makedirs\(os\.path\.dirname\(self\.output\_path\),exist\_ok=True\)

158withopen\(self\.output\_path,’w’,encoding=’utf\-8’\)asf:

159json\.dump\(data,f,indent=2,ensure\_ascii=False\)

160logger\.info\(f"Successfullysaved\{len\(data\)\}entriesto\{self\.output\_path\}"\)

161

162defparse\_args\(\):

163parser=argparse\.ArgumentParser\(description="Rule\-basedDataAugmentationTool"\)

164parser\.add\_argument\(’\-\-seed\_path’,type=str,default=’\.\./data/seed/seed\.json’,help=’Pathtoseeddata’\)

165parser\.add\_argument\(’\-\-output\_path’,type=str,default=’\.\./submission/submission\.json’,help=’Pathtosavesubmission’\)

166parser\.add\_argument\(’\-\-target\_size’,type=int,default=5000,help=’Targetdatasetsize’\)

167returnparser\.parse\_args\(\)

168

169if\_\_name\_\_=="\_\_main\_\_":

170args=parse\_args\(\)

171

172augmenter=DataAugmenter\(

173seed\_path=args\.seed\_path,

174output\_path=args\.output\_path,

175target\_size=args\.target\_size

176\)

177

178try:

179augmenter\.run\(\)

180exceptKeyboardInterrupt:

181logger\.info\("Pipelineinterruptedbyuser\.Savingpartialprogress\.\.\."\)

182pass

183exceptExceptionase:

184logger\.exception\("Fatalpipelineerror"\)

185exit\(1\)

## Appendix HSeed Examples

In this section, we present representative seed examples used in our experiments for the Science, Code, and Finance domains\. These examples are extracted directly from theseed\.jsonfiles of the respective datasets\.

### H\.1Science Domain

The seed data for Sci\-Bench primarily consists of complex scientific problems involving physics and chemistry calculations\.

1\[

2\{

3"question":"Threeidenticalmetalsphereshavethesamediameter\.Spheres1and2carryequallikecharges

4Q,withseparationmuchgreaterthantheirdiameter,andexperienceforceF\.Sphere3isunchargedwithan

5insulatinghandle\.Ifsphere3touchessphere1,thentouchessphere2,andisremoved,whatisthenew

6interactionforcebetweenspheres1and2?"

7\},

8\{

9"question":"Inanisolatedtownof5000inhabitants,thespreadofanepidemicissuchthattherateof

10spreadisjointlyproportionaltothenumberofinfectedanduninfectedpeople\.If160peopleareinfectedat

11thestartand1200areinfectedafteroneweek,howlongdoesittakefor80%ofthepopulation\(4000people\)

12tobecomeinfected?"

13\},

14\.\.\.

15\]

### H\.2Code Domain

The seed data for the Code task includes algorithmic problems with problem descriptions, examples, and test inputs\.

1\[

2\{

3"question\_content":"Givenn,aanddasthenumberofterms,firsttermandcommondifferencerespectively

4ofanArthimeticSeries\.Findthesumoftheseriesuptonthterm\.\\n\\nExample1:\\nInput:513\\nOutput:

535\\nExplanation:Seriesupto5thtermis\\n1471013,sosumwillbe35\.\\nExample2:\\nInput:312\\n

6Output:9\\nExample:Seriesupto3rdtermis\\n135,sosumwillbe9\.\\n\\nYourTask:\\nYoudon’tneedto

7readorprintanything\.Yourtaskistocompletethefunctionsum\_of\_ap\(\)whichtakesn,aanddasinput

8parameterandreturnsthesumoftheseries\.\\n\\nExpectedTimeComplexity:O\(1\)\\nExpectedSpaceComplexity:

9O\(1\)\\n\\nConstranits:\\n1<=n,a,d<=100",

10"test\_input":"513"

11\},

12\{

13"question\_content":"Let$f\_\{x\}=c^\{2x\-6\}\\\\cdotf\_\{x\-1\}\\\\cdotf\_\{x\-2\}\\\\cdotf\_\{x\-3\}$for$x\\\\ge4$\.

14\\n\\nYouhavegivenintegers$n$,$f\_\{1\}$,$f\_\{2\}$,$f\_\{3\}$,and$c$\.Find$f\_\{n\}\\\\bmod\(10^\{9\}\+7\)$\.

15\\n\\n\\n\-\-\-\-\-Input\-\-\-\-\-\\n\\nTheonlylinecontainsfiveintegers$n$,$f\_\{1\}$,$f\_\{2\}$,$f\_\{3\}$,and$c$\($4

16\\\\len\\\\le10^\{18\}$,$1\\\\lef\_\{1\}$,$f\_\{2\}$,$f\_\{3\}$,$c\\\\le10^\{9\}$\)\.\\n\\n\\n\-\-\-\-\-Output\-\-\-\-\-\\n\\nPrint

17$f\_\{n\}\\\\bmod\(10^\{9\}\+7\)$\.\\n\\n\\n\-\-\-\-\-Examples\-\-\-\-\-\\nInput\\n51253\\n\\nOutput\\n72900\\n\\nInput\\n179741

183711\\n\\nOutput\\n317451037\\n\\n\\n\\n\-\-\-\-\-Note\-\-\-\-\-\\n\\nInthefirstexample,$f\_\{4\}=90$,$f\_\{5\}=72900$\.

19\\n\\nInthesecondexample,$f\_\{17\}\\\\approx2\.28\\\\times10^\{29587\}$\.",

20"test\_input":"51253"

21\},

22\.\.\.

23\]

### H\.3Finance Domain

The Finance\-Reasoning seed data comprises specific financial scenarios \(context\) and quantitative questions requiring reasoning over that context\.

1\[

2\{

3"question":"WhatisAlice’snewadjustedmonthlymortgagepaymentafterthefixed\-rateperiodforthe

4remaining10years?Answerindollars,roundedtothenearestcent\.",

5"context":"Alicetooka15\-yearfixed\-ratemortgagewithaprincipalamountof$250,000atanannual

6interestrateof4\.5%\.Afterthefixed\-rateperiodended,theremainingprincipalbalancewas$150,000\.

7Hermortgagetransitionedtoanadjustable\-ratewiththecurrentindexrateat2%andabankmarginof1\.5%\.

8Shewantstocalculatehernewmonthlypaymentfortheremaining10yearsofthemortgageunderthesenew

9terms,assumingtherearenoratecaps\."

10\},

11\{

12"question":"Whatisthedifferenceinthehighandlowpricesofthecommonstockinthefourthquarterof

132019?Answertotwodecimalplaces\.",

14"context":"\{\\"2019:\-\-FourthQuarter\\":\{\\"High\\":11\.44,\\"Low\\":9\.47\},\\"2019:\-\-ThirdQuarter\\":

15\{\\"High\\":14\.96,\\"Low\\":10\.26\},\\"2019:\-\-SecondQuarter\\":\{\\"High\\":20\.91,\\"Low\\":12\.61\},\\"2019:

16\-\-FirstQuarter\\":\{\\"High\\":18\.19,\\"Low\\":8\.87\},\\"2018:\-\-FourthQuarter\\":\{\\"High\\":12\.16,

17\\"Low\\":7\.43\},\\"2018:\-\-ThirdQuarter\\":\{\\"High\\":20\.6,\\"Low\\":10\.95\},\\"2018:\-\-SecondQuarter\\":

18\{\\"High\\":18\.3,\\"Low\\":6\.7\},\\"2018:\-\-FirstQuarter\\":\{\\"High\\":7\.35,\\"Low\\":6\.0\}\}"

19\},

20\.\.\.

21\]

## Appendix IPrompt Templates

In this section, we present the core prompt templates used in our framework\. To ensure reproducibility, we provide the full content of the system instructions and the specific action prompts used by the agents\. Note that placeholders such as\{dataset\_size\}are dynamically filled during runtime\.

### I\.1System Description

The following system prompt is injected into the context of all agents \(both One\-Shot and Iterative\) to define the AutoDataBench task environment, constraints, and evaluation criteria\.

System Prompt: Task InstructionYou are participating data\-driven model specialization, a new challenge designed to test your ability to act as an autonomous AI data synthesis scientist\.===== TASK ===== Your mission is to craft a high\-quality, synthetic instruction\-tuning dataset for a given task\. Your formance is evaluated by how much a standard model’s performance improves after being fine\-tuned on the dataset you create\. This task evaluates your core capabilities in understanding, synthesizing and distilling valuable training data from raw resources\. Please read the instructions below, which are specific to this environment\. These instructions can be found again in ‘\.\./agent/utils/instructions\.py‘\.===== RESOURCES ===== All necessary resources for your data synthesis process are located in\.\./data/\. This includes:•\.\./data/description\.md: Describes the target task, including the goal and example data format\. \[Description Content\]•\.\./data/seed/seed\.json: Necessary raw material from which you create instruction\-tuning pairs, refine and expand this as needed\. Provided only if required by the task\. \[Source Content\]•\.\./data/seed/few\_shot\.json: A small set of high\-quality examples to follow for style, complexity, and format\. \[Few\-shot Content\]•\.\./data/sample\_submission\.json: A template demonstrating the required JSON structure for your final submission instruction\-tuning file\. \[Sample Submission\]•\.\./data/train\_config\.yaml: Contains the model architecture and default hyperparameters for fine\-tuning\. \[Train Config\]===== SUBMISSION ===== Your final output is a synthetic training dataset\. You MUST produce this dataset at EXACTLY\.\./submission/submission\.json, strictly adhere to the format specified insample\_submission\.json\. The submission should be around \{dataset\_size\} entries, regardless of the amount of seed data provided \(if any\)\. Too little synthetic data severely degrades model training performance\.===== ENVIRONMENT ====== •Teacher Models: API Access of \{teacher\_model\}\.•Target Models: Standard models to be fine\-tuned and then evaluated: \{student\_model\}\.•GPU Avaliable: 160GB\.•Synthetic Data Constraint: We will filter your final submission to retain only entries matching the textttsample\_submission\.csv format with non\-empty ‘output‘ fields, and use the first \{dataset\_size\} entries for training\.•API Limit: You are limited to a total of \{api\_limit\} calls to the API\-based teacher models\.•Runtime Limit: Your entire process must complete within \{max\_hours\} hours\.===== IMPORTANT NOTES ====== •You must save your final dataset at exactly the specified path:\.\./submission/submission\.json\. Save regularly to prevent data loss\.•Your only task is to generate instruction\-tuning dataset\.Do not include any code for model training or evaluation\.•You are only allowed to generate instruction\-tuning data by calling the provided tearcher models\.Do not directly enumerate synthetic data in the code\.•You should synthesize dataset that is diverse, complex, and task\-aligned\. The given few\-shot examples are for reference only in terms of format and quality\.•You should be mindful to stay within your allocated API call quota\. Avoid using loops that only check the number of generated entries while ignoring the API calls limit\.•You must ALWAYS prioritize calling the helper functions \(if provided\) directly to perform any relevant task\. You are strictly prohibited from re\-implementing their logic or creating any similar functions\.•You are participating in this competition independently\. Ensure that the code you generate is DIRECTLY executable, no dummy implementations or placeholders of any kind\.

### I\.2One\-Shot Agent

The One\-Shot Agent receives a single comprehensive prompt asking for a plan and the execution code\. It does not receive feedback from the execution environment unless a retry is triggered by a crash\.

One\-Shot Agent: Generation PromptYour response should include a brief plan for the data synthesis, followed by a single markdown code block that implements this plan and generates the final synthetic data\. Conduct a concise analysis of the given information, and then wrap the plan and code separately in Markdown Code Blocks\.Example Response:\[\.\.\. Necessary Analysis \.\.\.\] Here is the plan:\[\.\.\. Brief Plan \.\.\.\] Here is the code:\[\.\.\. Implemented Code \.\.\.\]

### I\.3Iterative Agent

The Iterative Agent operates in a loop\. Depending on the state of the previous iteration \(Success, Execution Error, Submission Error, or Improvement Opportunity\), it receives different prompts\.

Iterative Agent: Draft \(Initial Generation\)===== CURRENT STATUS ===== \- Remaining Time of All Iterations: \{remaining\_hours\} hours\- Remaining API Calls of All Iterations: \{remaining\_calls\}\- API Calls Limit For This Iteration: \{SESSION\_API\_LIMIT\}Keep these constraints in mind when planning your next action\.Propose a brief plan and implementation code for synthesizing a high\-quality dataset\. Conduct a concise analysis of the given information, and then wrap the plan and code separately in‘‘‘blocks\.

Iterative Agent: Debug \(Execution Failure\)The data generation process failed during execution\. This is debug attempt \{debug\_attempts\}\.Failed Code:\{current\_code\} Error Message:\{last\_error\} Conduct a concise analysis of the given information, and then wrap the correction plan and the code in separate‘‘‘blocks\.

Iterative Agent: Repair \(Invalid Submission Format\)Your previous attempt resulted in an invalid submission file located at\.\./submission/submission\.json\. Your task is to resolve the issue\.Here is your original plan:\{current\_plan\} Here is your original code:\{current\_code\} Here is the submission error details:\{last\_error\}You can either repair the existing file or regenerate the data\. Conduct a concise analysis… \[Instructions on format\]

Iterative Agent: Improve \(Optimization on Success\)Your current best solution achieved a metric of \{best\_metric\}\. Your task is to improve the dataset\. Make full use of the remaining API calls\. \[Optional: Performance of the base model on this test set: \{score\}\]Here is your original plan:\{best\_plan\} Here is your original code:\{best\_code\} Here are the submission details: \[Sample of submission file\]The model trained on that data failed on these cases:\{bad\_case\_sample\}Analyze these failures to identify model weaknesses and generate more targeted data\. You can either improve the existing data quality or regenerate the data\.

Similar Articles

Building more with GPT-5.1-Codex-Max

OpenAI Blog

OpenAI introduces GPT-5.1-Codex-Max, a new agentic coding model with improved reasoning, token efficiency, and the ability to maintain coherent work across millions of tokens through a 'compaction' mechanism. The model is faster, more intelligent, and can sustain long-running tasks for hours or days, representing a significant advancement in AI-assisted software engineering.

Scaling domain expertise in complex, regulated domains

OpenAI Blog

Blue J demonstrates how to scale AI expertise in complex regulated domains by combining GPT-4.1 with retrieval-augmented generation over curated tax documents, achieving <0.14% error rates and 70% weekly user engagement through rigorous feedback loops and domain-specific optimization.

Experiments in Agentic AI for Science

arXiv cs.AI

This paper presents two agentic AI frameworks, DeepTS/DeepCollector and DeepScribe, that automate scientific workflows including time-series data curation and conversion of physics lectures into structured reports, using a hybrid local-cloud architecture with LLMs.