@neural_avb: https://x.com/neural_avb/status/2072294078805684613

X AI KOLs Timeline 07/01/26, 12:19 PM Papers

synthetic-data data-generation agentic-loop reinforcement-learning grpo reasoning small-language-models

Summary

This paper introduces Autodata, a method that uses an agentic 'data scientist' AI to automate the creation of high-quality synthetic datasets through iterative generation, verification, and refinement, specifically optimized for reinforcement learning (GRPO) to improve reasoning in language models.

https://t.co/ExTshiJEae

Original Article

View Cached Full Text

Cached at: 07/01/26, 08:13 PM

AutoData: Synthetic data generation explained

This new paper Autodata tries to automating the** creation of high-quality synthetic datasets **using an agentic “data scientist” rather than a single prompt or one-shot generator.

Data quality is the SINGLE MOST important thing when training small language models (and often big ones too). We need useful, correct, diverse, and aligned with the downstream goal for the model.

I have been working on synthetic local-first data-gen library inspired by this Huggingface article - so I was super excited to see such work being done at scale.

This article was co-authored by me (AVB) and Claude Sonnet 5 inside the Paper Breakdown harness.

“Synthetic data” is cheap

but “good synthetic data” is not

Now generating synthetic data is actually pretty easy. You can just prompt an LLM to “write 10,000 examples” about a topic list.

As you can imagine, this usually sucks.

Examples can be incorrect, too easy, repetitive, or exploitable by shortcuts.
Quality often requires iteration: proposing candidates, checking them, revising prompts, and selecting only good items i.e., the kind of workflow humans follow.

Autodata defines a general method where an LLM agent is responsible for producing a dataset that optimizes downstream performance:

Input: a target task / domain, plus a budget of inference-time compute (how many tokens you can afford).
Process: an agentic loop that generates candidate items, challenges or verifies them, and selects/refines until the dataset is high quality. (discussed in detail later)
Output: a synthetic dataset (training and/or evaluation) that leads to good model performance.

What they want

The paper’s ultimate goal is to make LLMs **better reasoners. **For example, better at CS research questions, legal analysis, math/scientific reasoning. The chosen training method for turning “good data” into “a better model” is reinforcement learning, specifically GRPO (Group Relative Policy Optimization), not supervised fine-tuning (SFT).

This matters because SFT and RL have very different appetites for data:

SFT just needs (input, correct output) pairs — it doesn’t care about difficulty calibration, because it’s directly imitating the target.
RL (GRPO) doesn’t work that way. The model being trained generates multiple rollouts (attempts) per question. Each rollout gets a reward (via the rubric or verifier). The training signal (the advantage) comes from comparing rollouts against each other within the same group.

The Algorithm

The paper presents the algorithm at two levels: a general template (Autodata) and a specific practical instantiation they actually run experiments on (Agentic Self-Instruct). Let’s walk through both, since the specific algorithm is really just a concrete filling-in of the general loop.

1. The General Autodata Loop

At the highest level, Autodata is an iterative loop with three ingredients:

Data Creation: An agent grounds itself on source material (documents, code, legal texts, math objects, etc.), and uses tools + inference-time compute to generate candidate training or evaluation examples.
Data Analysis: The agent inspects what it just created: Is each example correct? Challenging enough? High quality? At the dataset level: Are examples diverse? Would they actually improve a model if trained on?
Iterate / Stop: Learnings from analysis feed back into the next round of creation. The agent repeats this create → analyze → refine cycle until a stopping criterion is met (e.g., quality threshold reached), at which point it emits a final dataset. Guardrails are built into the outer loop specifically to prevent the agent from “hacking” its own quality checks.

There’s also an outer meta-optimization loop (which I’ll cover separately) that can tune the agent itself to become a better data scientist over time.

2. The Concrete Algorithm: Agentic Self-Instruct

This is the specific implementation used in their experiments. Instead of one monolithic agent, the main orchestrator agent delegates to four LLM subagents: Challenger, Weak solver, Strong solver, and Judge.

In their experiments, they use Kimi-K2.6 as the challenger as well as the judge. Qwen3.5-4B as the weak solver, and Qwen3.5-397B-A17B as the strong solver. Eventually they use the generated data to improve the weak solver (Qwen3.5-4B) with GRPO finetuning.

Here is the step-by-step algorithm:

Grab a **source corpora: **Raw unstructured data grounded on our target domain. Such as research papers, legal documents, medical articles, etc.
Propose: The main agent sends its current prompt (with grounding context) to the Challenger, which produces a candidate example. For example, if the input document are CS papers, the challenger will read the entire paper and generate task lists from it.
Stress-test: The main agent sends this example to both the weak solver and the strong solver.
Judge: The Judge evaluates the solver outputs (and the example’s own quality. Is the question well-posed? Is the reference answer/rubric correct?), assigning a reward/verdict.
Acceptance check — this is where the algorithm branches based on task type:

Verifiable tasks: Accept the example if majority vote of the strong solver is correct while majority vote of the weak solver is wrong. This guarantees a genuine difficulty gap and correctness. ** Non-verifiable tasks**: Accept if the judge-measured quality gap shows the task is neither too easy nor too hard for the weak solver, while the strong solver’s success helps confirm overall correctness (via rubrics generated by the Challenger).**

Refine or Accept: If the criterion is not met, the main agent doesn’t discard everything. Instead it modifies the prompt it sends to the Challenger using the new learnings from the judge’s report, and loops back to step 1. If the criterion is met, the example is accepted into the dataset.

This closed loop lets the system “learn how to create challenging and high-quality examples” specifically targeted at training the weak solver. Essentially manufacturing a curriculum of examples sitting right at the boundary of the weak model’s capability.

A subtle but important detail: the weak and strong solver can be the exact same underlying LLM, just operated in different “modes” - e.g., the strong version gets more inference-time compute, extra scaffolding/aggregation, or access to privileged information the weak version doesn’t see.

How quality is enforced

This is the task of the judge module.

Correctness / leakage check: The Judge/Verifier checks whether the context+question pair leaks the answer - i.e., could someone construct the answer just by paraphrasing the context, without real reasoning? If so, it’s rejected.
Reasoning vs. recall check: The verifier explicitly flags questions that only test recall (“what,” “which,” “how many”) instead of reasoning (“why,” “what-if,” “predict,” “decide”).
Rubric quality check: For non-verifiable domains (like legal reasoning or CS research questions), rubrics must have a strict structure - e.g., 10-15 total criteria, at least 4 positive/3 negative, each criterion must require reasoning “beyond the context,” not vague style complaints.
The weak/strong gap criterion (the core mechanism): the example is only accepted if the weak solver’s average score is capped (≤65%≤65%, with no individual runs scoring high), the strong solver clears a floor (≥60%≥60% but <95%<95% — not saturated either), and the gap between them is ≥20%≥20%.

**Tying this back to RL and GRPO:

Recall that GRPO relies on diverse advantages within a group to actually learn. If all generations in a group have zero reward (too difficult), or 100% rewards (too easy), the model simply won’t learn because there is no discriminatory signal (i.e. all advantages within a group is 0). ** The weak-solver/strong-solver gap check happens at data-generation time, as a proxy for “will this example actually produce a useful gradient signal once we run GRPO on it.”

It’s essentially a cheap simulation of RL training dynamics. Instead of waiting to run full RL and discovering after the fact that a batch of questions was degenerate (all-zero or all-hundred rewards), Autodata pre-screens each question by running the weak and strong solvers on it and checking the variance/gap before committing it to the training set.

What about mode collapse / diversity?

This is a real risk with any self-instruct-style pipeline (an LLM generating its own training data can converge to repetitive templates). The paper addresses this in a few ways, though it’s worth being honest that it’s not treated as heavily as correctness:

Explicit “new angle” instruction: On every refinement round, the Challenger is told to generate an “ENTIRELY NEW question from a DIFFERENT angle that requires deeper reasoning” rather than a rephrasing of the rejected one.
Grounding diversity via large, varied source corpora: Rather than relying on the LLM to invent diverse content from nothing, Autodata grounds each generation in a distinct real-world document: a different CS paper, a different legal document, etc. This externally injects diversity rather than depending on the LLM’s own creativity.

Every single example is anchored to a real source document: a CS paper, a legal document, etc. The Challenger explicitly reads the source file (e.g., “read the paper from ./paper.txt directly”) before generating anything.

Meta Optimization and Auto-Research

Up to now, everything we discussed (Challenger → solvers → Judge → feedback → retry) is the inner loop: fixed prompts, fixed strategy, generating one data point at a time. Meta-optimization adds an outer loop that treats the agent’s own prompts and strategy as the thing being optimized, using the same success criteria that judge individual data points (weak/strong solver gap, etc.) as the fitness signal for improving the harness itself.

They use an evolutionary optimization framework that treats the agent’s scaffold (its system prompts, decision logic) as code to be iteratively mutated and selected.

Basically they added an extra layer of auto-research on top of the data generation task.

The meta-optimizer maintains a population of candidate prompts, each represented as a code diff relative to a baseline repository. Each iteration does the following:

Select a parent from the population via Boltzmann sampling - candidate c is chosen with probability proportional to it’s quality score. This strongly favors high-scoring candidates while still allowing exploration of weaker ones (so it doesn’t get stuck at a local optimum).
Evaluate the parent’s prompt by running it on a minibatch of training papers. This means actually running the full inner Agentic Self-Instruct loop and collecting the agent trajectories plus weak/strong solver scores!
Diagnose failures: an LLM agent (the “analyzer”) reads the full solver exchanges from those trajectories and writes a root-cause analysis of systematic failure patterns. E.g., “why do these questions keep failing to discriminate?”
Mutate the prompt: a code-editing agent (the “implementer”) reads that analysis, the iteration history, and the current prompt, and produces an **improved diff, **i.e. a new candidate prompt.
Re-evaluate both the parent and the new mutant on held-out validation papers (never seen during that iteration’s training minibatch).
Selection: the mutant is only added to the population if its validation score strictly exceeds its parent’s score, otherwise it’s discarded.
Log the outcome into a history log that future analyzer calls can read, so learnings accumulate across iterations.

Multiple such iterations run concurrently with independent parent selections, effectively parallelizing the search across the population.

Evaluating synthetic data quality

While the gap between weak solver and strong solver is a good proxy, the main evidence for “does better data actually produce a better model” is to actually train the model. The general recipe:

Take 1.3k-2.8k generated examples (depending on domain) from each data source (CoT Self-Instruct vs. Agentic Self-Instruct).
Train Qwen3.5-4B with GRPO (batch size 16, learning rate 1e-6) on each dataset separately.
Evaluate the resulting models on held-out test sets.

For example, on Computer Science tasks,

Take over 10,000 CS papers from the S2ORC corpus (2022+)
Pick your models: Main orchestrator + Challenger: Kimi-K2.6 Strong solver: Qwen3.5-397B-A17B Weak solver: Qwen3.5-4B (this is the model that will eventually be RL-trained on the accepted data)

The Challenger’s job is to read the paper, then produce (1) a question type label, (2) 2-3 reasoning-skill tags, (3) a context that situates the solver without leaking the answer, (4) a question testing deep reasoning rather than recall, (5) a reference answer, and (6) a weighted rubric with 10-15 criteria

Dataset is generated! They only accept training examples where strong solver average ≥ **0.65, **weak solver average < 0.5.
After the dataset generation finishes, they train Qwen3.5-4B with GRPO (batch size 16, learning rate 1e-6) on 1.3k examples, holding out 100 examples per source as a test set, and evaluate the trained models on both held-out test sets.
For comparison, they generate a separate dataset with CoT-style prompting (the baseline datagen method that lacks the whole reflection loop they proposed with AutoData). And the model trained on AutoData (Agentic-trained) dataset wins!

That’s it! They also tested on other domains (other than CS), and all their prompts are present in the appendix section of their paper.

Read the paper here for more details: http://arxiv.org/abs/2606.25996

Or read it on Paper Breakdown: https://paperbreakdown.com/abs/2606.25996

@neural_avb: https://x.com/neural_avb/status/2072294078805684613

AutoData: Synthetic data generation explained

“Synthetic data” is cheap

What they want

The Algorithm

1. The General Autodata Loop

2. The Concrete Algorithm: Agentic Self-Instruct

How quality is enforced

What about mode collapse / diversity?

Meta Optimization and Auto-Research

Evaluating synthetic data quality

Similar Articles

Autodata: An agentic data scientist to create high quality synthetic data

@rohanpaul_ai: Very important Meta paper brings Autodata, an agentic data scientist to create high quality synthetic data. The main re…

Agents That Build Better Training Data (25 minute read)

@jaseweston: Claim: Autoresearch that moves the frontier will be about better data: we call that Autodata. 1/6 -- Paper is out! ht…

@HarveenChadha: meta releases Autodata: an agentic data scientist to create high quality synthetic data basically its a loop. given a d…

Submit Feedback

Similar Articles

Autodata: An agentic data scientist to create high quality synthetic data

@rohanpaul_ai: Very important Meta paper brings Autodata, an agentic data scientist to create high quality synthetic data. The main re…

Agents That Build Better Training Data (25 minute read)

@jaseweston: Claim: Autoresearch that moves the frontier will be about better data: we call that *Autodata*. 1/6 -- Paper is out! ht…

@HarveenChadha: meta releases Autodata: an agentic data scientist to create high quality synthetic data basically its a loop. given a d…

AutoData: Synthetic data generation explained

“Synthetic data” is cheap

What they want

The Algorithm

1. The General Autodata Loop

2. The Concrete Algorithm: Agentic Self-Instruct

How quality is enforced

What about mode collapse / diversity?

Meta Optimization and Auto-Research

Evaluating synthetic data quality

Similar Articles

Autodata: An agentic data scientist to create high quality synthetic data

@rohanpaul_ai: Very important Meta paper brings Autodata, an agentic data scientist to create high quality synthetic data. The main re…

Agents That Build Better Training Data (25 minute read)

@jaseweston: Claim: Autoresearch that moves the frontier will be about better data: we call that *Autodata*. 1/6 -- Paper is out! ht…

@HarveenChadha: meta releases Autodata: an agentic data scientist to create high quality synthetic data basically its a loop. given a d…

Submit Feedback

@jaseweston: Claim: Autoresearch that moves the frontier will be about better data: we call that Autodata. 1/6 -- Paper is out! ht…