EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

arXiv cs.AI 06/03/26, 04:00 AM Papers
Summary
EvoTrainer introduces an autonomous training framework that co-evolves LLM policies and training harnesses through empirical feedback, outperforming human-engineered RL baselines on mathematical reasoning, code generation, and long-horizon software engineering tasks.
arXiv:2606.03108v1 Announce Type: new Abstract: Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:43 AM
# EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning
Source: [https://arxiv.org/html/2606.03108](https://arxiv.org/html/2606.03108)
Guhong Chen1,Yingcheng Shi2,Yongbin Li2,†\\dagger,Binhua Li2,Xander Xu3, Hu Wei3,Shiwen Ni4,Min Yang1,4,†\\dagger,Jieping Ye2

1Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences 2Tongyi Lab![[Uncaptioned image]](https://arxiv.org/html/2606.03108v1/figures/tongyi.jpg), Alibaba Group3Alibaba Group4SUAT

###### Abstract

Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static\. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes\. We introduceEvoTrainer, an autonomous training framework that co\-evolves LLM policies and training\-side harnesses through empirical feedback: it diagnoses rollout\-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills\. Evaluated on mathematical reasoning, competitive\-programming code generation, and repository\-level software engineering, EvoTrainer matches or exceeds the human\-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long\-horizon agentic SWE\. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high\-scoring branches from being promoted, and reusable skills shape later search\. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them\.

EvoTrainer: Co\-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

Guhong Chen1, Yingcheng Shi2, Yongbin Li2,†\\dagger, Binhua Li2, Xander Xu3,Hu Wei3,Shiwen Ni4,Min Yang1,4,†\\dagger,Jieping Ye21Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences2Tongyi Lab![[Uncaptioned image]](https://arxiv.org/html/2606.03108v1/figures/tongyi.jpg), Alibaba Group3Alibaba Group4SUAT

$\\dagger$$\\dagger$footnotetext:Corresponding authors:[shuide\.lyb@alibaba\-inc\.com](https://arxiv.org/html/2606.03108v1/mailto:[email protected]),[min\.yang@siat\.ac\.cn](https://arxiv.org/html/2606.03108v1/mailto:[email protected])## 1Introduction

AI systems are beginning to participate in model development by editing code, launching experiments, inspecting outcomes, and proposing new training versions\(Karpathy,[2026](https://arxiv.org/html/2606.03108#bib.bib15); Lu et al\.,[2024](https://arxiv.org/html/2606.03108#bib.bib21); Yamada et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib31); Ning et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib22); Jeddi et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib14)\)\. These systems suggest that future model improvement may depend not only on human\-designed training recipes, but also on agents that can iteratively revise them through empirical feedback\.

![Refer to caption](https://arxiv.org/html/2606.03108v1/x1.png)Figure 1:Overview of EvoTrainer: an autonomous training framework that co\-evolves LLM policies and training\-side diagnostic harnesses, exceeding the human\-engineered RL baseline on SWE\-9B by\+4\.39 BC%\.However, most autonomous experimentation systems still keep the decision infrastructure around training largely fixed: they search over candidate recipes yet rely on the same diagnostic views, memory, and intervention logic to interpret each new result\. This is limiting in complex reinforcement learning, where the dominant bottleneck may shift from reward sparsity to behavior collapse, from evaluation artifacts to low\-information rollout groups, or from recipe selection to the need for reusable diagnostic tools\. Scalar validation scores are one visible failure mode; the broader problem is that the evidence and procedures needed to guide training may themselves need to evolve\.

This challenge is especially pronounced in agentic RL\. In such settings, a model may search files, invoke tools, edit code, execute tests, inspect error messages, and submit a final solution only after many turns\. The resulting training process is difficult to steer with a fixed diagnostic template: successful scores may hide reward leakage or unhealthy behaviors, failed branches may reveal valuable negative evidence, and later versions may require analyses that were unnecessary at earlier stages\. The diagnostics needed to interpret training outcomes therefore often change across versions and are difficult to specify fully in advance\.

This paper studies the training system itself as an object of improvement\. We use the term trainer to denote the decision\-making system that observes completed versions, analyzes rollout evidence, proposes interventions, updates diagnostic infrastructure, and determines what should be tested next\. The policy improves within a training run; the trainer improves across training runs by accumulating evidence, revising its harness, and reusing operational skills\.

We proposeEvoTrainer\(Figure[1](https://arxiv.org/html/2606.03108#S1.F1)\), an autonomous training framework that co\-evolves LLM policies and training\-side diagnostic harnesses through two coupled processes: policy self\-evolution, where runnable training versions are generated, compared, pruned, promoted, and merged through controlled interventions; and trainer self\-reflection, where the training\-side harness evolves when existing metrics, analyzers, backtests, or search procedures are insufficient\. A persistent memory and reusable skill library let later iterations retrieve failed\-branch lessons, diagnostic scripts, and previously validated mechanisms\. The trainer agent autonomously runs this loop by constructing versions, diagnosing outcomes, revising the harness, and proposing interventions, while humans bootstrap the workspace and approve costly or consequential execution \(Section[3\.5](https://arxiv.org/html/2606.03108#S3.SS5)\)\.

Table[1](https://arxiv.org/html/2606.03108#S1.T1)situates EvoTrainer among representative autonomous experimentation systems\. AutoResearch and Bilevel Autoresearch target training\-recipe optimization on GPT pretraining benchmarks\(Karpathy,[2026](https://arxiv.org/html/2606.03108#bib.bib15); Qu and Lu,[2026](https://arxiv.org/html/2606.03108#bib.bib24)\)\. GEAR introduces population\-style search over agentic code\(Jeddi et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib14)\); Meta\-Harness and AHE optimize inference\-side harnesses for LLM applications\(Lee et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib16); Lin et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib18)\)\. To our knowledge, EvoTrainer is the first autonomous training framework in agentic LLM RL to treat the training\-side diagnostic harness itself as an evolving object\.

Table 1:Capability comparison with representative autonomous experimentation systems\. Inference\-side harness evolution optimizes scaffolding around the model at inference time \(context, tools, memory\); training\-side harness evolution revises the diagnostic infrastructure that interprets training\-time outcomes\. Symbols:✓= capability supported;✗= not present per the cited paper;–= unclear from the cited paper\.We evaluate EvoTrainer on mathematical reasoning, competitive\-programming code generation, and repo\-level software engineering\. EvoTrainer consistently improves over the no\-RL base in every domain and matches or exceeds the human\-engineered RL references developed under the same data, codebase, and evaluation protocol, with the largest gain on SWE\-9B: 38\.16 BC% versus 30\.19 for no\-RL and 33\.77 for the human\-engineered RL baseline\. Component\-level analyses further show that retained strategies diverge across domains, the evolving harness rejects invalid high\-scoring branches, and retained skills alter later search, providing process\-level evidence beyond score\-driven iteration\.

Our contributions are threefold: \(i\) we formulate autonomous model training as cross\-version trainer improvement, where adaptation targets both the model recipe and the decision infrastructure that interprets outcomes; \(ii\) we introduce EvoTrainer, a dual\-evolution framework that jointly develops policy versions and a training\-side diagnostic harness through signal diagnosis, harness revision, persistent memory, and reusable skills; and \(iii\) we evaluate EvoTrainer across Math, Coding, and SWE, showing that it matches or exceeds human\-engineered RL references and providing process\-level evidence beyond score\-dominant iteration\.

## 2Related Work

### 2\.1Autonomous Research and Self\-Evolving Experimentation

Recent work automates increasing portions of scientific discovery and model development\. AutoResearch demonstrates a propose–train–evaluate loop on GPT pretraining benchmarks, while Bilevel Autoresearch meta\-optimizes the inner research loop\(Karpathy,[2026](https://arxiv.org/html/2606.03108#bib.bib15); Qu and Lu,[2026](https://arxiv.org/html/2606.03108#bib.bib24)\)\. Specialist\-agent frameworks cast training\-recipe optimization as auditable trajectories with failure\-aware feedback\(Ning et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib22)\)\. The AI Scientist line extends this toward end\-to\-end scientific discovery\(Lu et al\.,[2024](https://arxiv.org/html/2606.03108#bib.bib21); Yamada et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib31)\)\. GEAR introduces population\-based search over agentic code agents\(Jeddi et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib14)\)\. A related line of self\-improving systems evolves code, algorithms, or training curricula through empirical feedback\(Zhang et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib35); Novikov et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib23); Huang et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib11); Yu et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib33); Tao et al\.,[2024](https://arxiv.org/html/2606.03108#bib.bib27)\)\. EvoTrainer extends this line to agentic LLM RL training, where the trainer must additionally evolve a training\-side diagnostic harness to interpret rollout\-level evidence and steer interventions across versions\.

### 2\.2Harness and Infrastructure Optimization for LLM Systems

System performance depends on infrastructure surrounding the model, not only on model weights\. Meta\-Harness searches over harness code and shows automatically discovered task\-side harnesses can outperform hand\-engineered designs\(Lee et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib16)\)\. AHE evolves coding\-agent harnesses through observability\-driven trajectory analysis\(Lin et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib18)\); TDScaling uses diversity\-sensitive diagnostics to steer trajectory synthesis for code agents\(Chen et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib5)\)\. These works target inference\-time scaffolding around an LLM rather than the training process itself\. EvoTrainer operates on a different layer: a training\-side diagnostic harness that accumulates metrics, analyzers, backtests, retrieved evidence, and reusable skills to interpret policy\-version outcomes during RL training\. The signal substrate it targets—reward distributions, group variance, behavioral trajectories, dead\-group ratios, and cross\-version intervention evidence—is not primarily addressed by inference\-time harness work\. Adapting those systems would require reconstructing their search and evaluation logic around training\-time RL artifacts\.

### 2\.3Task\-Adaptive RL Design for Verifiable and Agentic Training

Recent RL methods for language models introduce specialized mechanisms: group\-relative updates in GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2606.03108#bib.bib26)\), Clip\-Higher and dynamic sampling in DAPO\(Yu et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib32)\), and sequence\-level optimization in GSPO\(Zheng et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib36)\)\. Complementary work in verifiable\-reward training documents optimization bias, diversity\-aware reward design, adaptive guidance, and verifiable\-environment construction\(Liu et al\.,[2025a](https://arxiv.org/html/2606.03108#bib.bib19); Chen et al\.,[2025b](https://arxiv.org/html/2606.03108#bib.bib7); Liu et al\.,[2025b](https://arxiv.org/html/2606.03108#bib.bib20); Zeng et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib34); Huang et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib12)\), indicating RL recipes are highly sensitive to task structure, reward granularity, data regime, and model scale\.

This sensitivity sharpens in agentic RL: RAGEN identifies Echo Trap and motivates trajectory\-level stabilization\(Wang et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib29)\); RAGEN\-2 proposes SNR\-aware variance filtering\(Wang et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib28)\)\. Long\-horizon and tool\-using agent studies document strong sensitivity to reward shaping and environment stability\(Chen et al\.,[2025a](https://arxiv.org/html/2606.03108#bib.bib6); Wu et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib30)\)\. EvoTrainer addresses this adaptation layer by letting the trainer diagnose version\-specific failures, retrieve or revise candidate mechanisms, and retain only interventions supported by cross\-version evidence\.

## 3EvoTrainer: Co\-Evolving LLM Policies and Training Harnesses

![Refer to caption](https://arxiv.org/html/2606.03108v1/x2.png)Figure 2:Overview of EvoTrainer\. The upper loop evolves policy versions through controlled exploration, training, evidence collection, and intervention planning; the middle loop evolves the training\-side diagnostic harness; and the bottom layer stores persistent memory and reusable skills\. The training\-core panel illustrates the SWE instantiation\.### 3\.1Versioned Autonomous Training

An autonomous trainer must do more than execute training jobs: given a completed version, it must determine what the outcome means, which intervention was responsible, what failure mode emerged, and which direction should be tested next\. We therefore formulate autonomous training as a sequence of evidence\-conditioned version transitions in which both policy versions and the training\-side diagnostic harness evolve jointly\.

Letv0,v1,…,vnv\_\{0\},v\_\{1\},\\ldots,v\_\{n\}denote the evolving policy versions\. Each versionviv\_\{i\}produces artifacts𝒜i=\{metricsi,rolloutsi,configsi,logsi,diffsi\}\\mathcal\{A\}\_\{i\}=\\\{\\mathrm\{metrics\}\_\{i\},\\mathrm\{rollouts\}\_\{i\},\\mathrm\{configs\}\_\{i\},\\mathrm\{logs\}\_\{i\},\\mathrm\{diffs\}\_\{i\}\\\}, which are interpreted by the current training harnesshih\_\{i\}\. We summarize a completed training state as𝒯i=\(vi,hi,𝒜i,di,Δi,ωi\)\\mathcal\{T\}\_\{i\}=\(v\_\{i\},h\_\{i\},\\mathcal\{A\}\_\{i\},d\_\{i\},\\Delta\_\{i\},\\omega\_\{i\}\), wheredid\_\{i\}is the diagnosis of the current version,Δi\\Delta\_\{i\}is the proposed intervention, andωi\\omega\_\{i\}is the observed outcome\. The outcome may indicate improvement, regression, mixed evidence, or insufficient evidence\.

This formulation makes the version transition the unit of autonomous improvement: a score\-improving version may still expose a fragile reward design, a regressed branch may reveal a harmful intervention, and a mixed outcome may show one bottleneck resolved while another becomes visible\. EvoTrainer preserves these distinctions instead of collapsing each run into a binary keep\-or\-reject event\. The two layers are then developed separately: policy self\-evolution \(Section[3\.2](https://arxiv.org/html/2606.03108#S3.SS2)\) and training\-side harness evolution \(Section[3\.3](https://arxiv.org/html/2606.03108#S3.SS3)\)\.

### 3\.2Policy Self\-Evolution through Version\-Controlled Exploration

The trainer agent first constructs a runnable training version with executable launch scripts, reward wiring, configuration files, and evaluation hooks; subsequent versions evolve through controlled exploration\.

At versionviv\_\{i\}, EvoTrainer creates a candidate setℬi=\{bi\(1\),bi\(2\),…,bi\(K\)\}\\mathcal\{B\}\_\{i\}=\\\{b\_\{i\}^\{\(1\)\},b\_\{i\}^\{\(2\)\},\\ldots,b\_\{i\}^\{\(K\)\}\\\}, where each branch applies an interventionΔi\(k\)\\Delta\_\{i\}^\{\(k\)\}to the current baseline:

bi\(k\)=vi⊕Δi\(k\)\.b\_\{i\}^\{\(k\)\}=v\_\{i\}\\oplus\\Delta\_\{i\}^\{\(k\)\}\.Here,⊕\\oplusdenotes a versioned modification to the current training recipe\. A candidate change may target rewards, filtering, data selection, rollout settings, optimization choices, or tool\-use behavior\.

EvoTrainer defaults to single\-factor interventions for clean attribution; bundled changes are admitted only when each component has prior independent support or when the interaction itself is being tested, with the rationale recorded in the version ledger and ambiguous outcomes revisited through targeted ablations or backtests\.

Branches are materialized as isolated worktrees and can be trained in parallel when resources permit\. After training, the trainer compares the resulting evidence and recommends whether to keep, prune, revert, or merge each branch\. A promoted branch becomes the next baselinevi\+1v\_\{i\+1\}, while failed or ambiguous branches remain useful as negative evidence\. This version\-control discipline makes autonomous experimentation auditable through an explicit lineage of explored interventions and their consequences\.

### 3\.3Trainer Reflection and Harness Evolution

A fixed diagnostic harness is often insufficient for long\-horizon autonomous training\. Early versions may be analyzable through coarse validation trends, whereas later versions may require new behavioral metrics, reward audits, group\-variance statistics, failure taxonomies, retrieval procedures, or backtesting scripts\. EvoTrainer therefore treats the training harness itself as an evolving object\.

We organize the harness around four diagnostic layers:*score*\(validation metrics, benchmark\-level improvements\),*signal*\(reward variance, dead\-group ratio, component\-level reward adoption\),*behavior*\(tool\-use patterns, search/edit/test structure, trajectory length, degeneration modes, environment\-side anomalies\), and*version*\(cross\-branch promotion, rejection, and retention decisions\)\. These layers are complementary: reward statistics from a rollout provide signal evidence, while action patterns provide behavior evidence\.

A diagnostic gap appears when existing evidence cannot support one of three functions: explaining a version outcome, distinguishing competing hypotheses, or selecting a defensible next intervention\. EvoTrainer upgrades the harness along four axes: metric expansion \(e\.g\., dead\-group ratio, rollout diversity, first\-edit turn\); analyzer specialization \(per\-domain analyzer routines\); procedure revision \(e\.g\., shifting from score\-first to behavior\-first analysis\); and external evidence retrieval \(querying papers, repositories, and prior traces to synthesize candidate interventions or harness updates\)\.

Training\-signal revisions surface from harness reflection but are executed via the policy\-evolution mechanism of Section[3\.2](https://arxiv.org/html/2606.03108#S3.SS2); Section[4\.4](https://arxiv.org/html/2606.03108#S4.SS4)examines its empirical effect\.

### 3\.4Persistent Memory and Reusable Skill Library

EvoTrainer maintains a persistent memory layerℳ=\{ℒver,𝒞case,𝒮skill,𝒮search\}\\mathcal\{M\}=\\\{\\mathcal\{L\}\_\{\\mathrm\{ver\}\},\\mathcal\{C\}\_\{\\mathrm\{case\}\},\\mathcal\{S\}\_\{\\mathrm\{skill\}\},\\mathcal\{S\}\_\{\\mathrm\{search\}\}\\\}\.ℒver\\mathcal\{L\}\_\{\\mathrm\{ver\}\}stores model lineage, configuration diffs, Git diffs, and keep/prune/merge decisions\.𝒞case\\mathcal\{C\}\_\{\\mathrm\{case\}\}stores recurring patterns: score\-and\-behavior divergence, low\-information rollout groups, or interventions that repeatedly fail under a particular domain structure\.𝒮skill\\mathcal\{S\}\_\{\\mathrm\{skill\}\}stores reusable analyzer skills, repair strategies, and procedure templates that, once validated, later versions can call, extend, or adapt\.𝒮search\\mathcal\{S\}\_\{\\mathrm\{search\}\}stores retrieval queries, external sources, distilled insights, and adopted patches or procedures\.

Memory thus turns EvoTrainer from isolated experiments into a cumulative process: a SWE filtering utility is later retrieved in Math and Coding when zero\-variance groups recur, showing operational rather than archival reuse \(Section[4\.4](https://arxiv.org/html/2606.03108#S4.SS4)\)\.

### 3\.5Trainer\-Agent Implementation and Autonomy Scope

In our experiments, the trainer is instantiated with Claude Sonnet 4\.6\(Anthropic,[2026](https://arxiv.org/html/2606.03108#bib.bib3)\); the workflow is replaceable at the trainer\-model interface, provided the trainer has access to repository files, shell execution, experiment artifacts, and retrieval utilities\. After humans bootstrap the task, benchmark, data, compute environment, and initial playbooks, the trainer agent autonomously runs the diagnostic loop and recommends interventions, while costly or consequential actions remain human\-gated \(Table[3\.5](https://arxiv.org/html/2606.03108#S3.SS5)\)\. This separation makes the training cognition loop autonomous while keeping costly or irreversible execution decisions under human control\.

Table 2:Autonomy scope in EvoTrainer: humans set up the workspace and gate costly or consequential execution, while the trainer agent performs the core iterative diagnostic loop\.### 3\.6SWE Training\-Core Instantiation

The final SWE trajectory uses a GRPO\-style training core\(Shao et al\.,[2024](https://arxiv.org/html/2606.03108#bib.bib26)\)with asymmetric Clip\-Higher\(Yu et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib32)\)bounds and weak KL regularization to a fixed reference policy \(full objective in Appendix[A](https://arxiv.org/html/2606.03108#A1)\)\. For each promptqq, the old policy samplesGGtrajectories\{o1,…,oG\}\\\{o\_\{1\},\\ldots,o\_\{G\}\\\}, each receiving a scalar rewardrir\_\{i\}\. The group\-relative advantage is

Ai\\displaystyle A\_\{i\}=ri−μgσg\+ϵ,\\displaystyle=\\frac\{r\_\{i\}\-\\mu\_\{g\}\}\{\\sigma\_\{g\}\+\\epsilon\},μg\\displaystyle\\mu\_\{g\}=1G∑j=1Grj,σg=std\(r1,…,rG\)\.\\displaystyle=\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}r\_\{j\},\\quad\\sigma\_\{g\}=\\mathrm\{std\}\(r\_\{1\},\\ldots,r\_\{G\}\)\.We broadcastAiA\_\{i\}to all tokens of trajectoryoio\_\{i\}, avoiding a learned value model but making group variance critical: whenσg=0\\sigma\_\{g\}=0, the group yields no useful relative learning signal\.

The final SWE reward design combines task correctness and behavior\-sensitive components:

ri=1\.0⋅CRi\+0\.1⋅IFi\+0\.1⋅SBEi\+0\.15⋅ETTi\.r\_\{i\}=1\.0\\cdot\\mathrm\{CR\}\_\{i\}\+0\.1\\cdot\\mathrm\{IF\}\_\{i\}\+0\.1\\cdot\\mathrm\{SBE\}\_\{i\}\+0\.15\\cdot\\mathrm\{ETT\}\_\{i\}\.Here,CR\\mathrm\{CR\}is hidden\-test correctness,IF\\mathrm\{IF\}is instruction\-following reward from a frozen judge,SBE\\mathrm\{SBE\}rewards search\-before\-edit behavior, andETT\\mathrm\{ETT\}rewards edit\-then\-test behavior\. These components are not introduced as a fixed handcrafted recipe; they emerge through the versioned evolution process analyzed in Section[4\.4](https://arxiv.org/html/2606.03108#S4.SS4)and Appendix[D](https://arxiv.org/html/2606.03108#A4)\. The asymmetric SBE/ETT weighting reflects a trainer\-side decision: an earlier symmetric variant \(0\.1⋅SBE\+0\.1⋅ETT0\.1\\cdot\\mathrm\{SBE\}\+0\.1\\cdot\\mathrm\{ETT\}\) produced reward ties across a substantial fraction of trajectories, collapsing within\-group variance and neutralizing the procedural reward signal\.

Before advantage normalization, EvoTrainer applies variance\-aware group filtering\. Groups with zero reward variance are removed as dead groups, and an adaptive filtering rule suppresses low\-information groups before policy updates\. This mechanism is especially important in SWE, where sparse hidden\-test outcomes can otherwise produce large fractions of non\-informative groups\. Math and Coding use the same GRPO core but with domain\-specific reward structures that emerge through the evolution process described in Section[4\.3](https://arxiv.org/html/2606.03108#S4.SS3)\.

## 4Experiments

We evaluate EvoTrainer on three domains with distinct training and diagnostic demands: mathematical reasoning, competitive\-programming code generation, and repo\-level software engineering\. Math and Coding are single\-turn generation settings, whereas SWE requires long\-horizon agent–tool interaction in an executable environment\. Together, these domains test whether EvoTrainer remains useful across training regimes with increasingly complex decision requirements, from single\-turn verifiable tasks to long\-horizon agentic interaction\.

### 4\.1Experimental Setup

#### Tasks and Data\.

For Math, we train on 6,429 problems from BigMath\-Hard\(Albalak et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib2)\)\(one example overlapping AIME 2024 P5 removed\) and evaluate on 78 competition problems: 30 AIME 2024, 30 AIME 2025, 18 CNMO 2024\. For Coding, we train on 11,897 verified problems from TACO\-verified\(Li et al\.,[2023](https://arxiv.org/html/2606.03108#bib.bib17)\)and evaluate on 175 held\-out problems from a recent LiveCodeBench\-v6 subset of AtCoder Beginner Contest tasks, with earlier\-released problems excluded to avoid contamination\(Jain et al\.,[2024](https://arxiv.org/html/2606.03108#bib.bib13)\); training and validation are disjoint, yielding an out\-of\-distribution protocol\. For SWE, we train on 8,622 instances from the swe\-rebench\-v6 train split and evaluate on 77 held\-out Python instances from the corresponding test split\(Badertdinov et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib4)\); each instance runs in a Docker environment judged by hidden fail\-to\-pass tests, with the tool scaffold, interaction protocol, and evaluation harness frozen across compared methods\.

#### Evaluation Protocol\.

All domains use Avg@8 \(mean over 8 independent rollouts per item\) with a single random seed \(seed4242\)\. Math correctness is determined by a frozen Qwen3\.5\-4B judge\(Qwen Team,[2026](https://arxiv.org/html/2606.03108#bib.bib25)\); Coding correctness by stdin/stdout execution; SWE by BC%, the proportion of valid executions passing hidden fail\-to\-pass tests \(excluding infrastructure\-level failures that do not reflect model behavior\)\.

#### Baselines\.

We compare EvoTrainer against three groups of references\. The first contains the no\-RL base model, the strongest human\-engineered RL configuration \(developed by the same research team under the same codebase, model family, data, and evaluation protocol\), and AutoResearch\(Karpathy,[2026](https://arxiv.org/html/2606.03108#bib.bib15)\)as a score\-only autonomous iteration reference\. The second contains representative algorithmic baselines instantiated within the same training stack: GRPO and GSPO variants\(Shao et al\.,[2024](https://arxiv.org/html/2606.03108#bib.bib26); Zheng et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib36)\), Clip\-Higher stabilization from DAPO\(Yu et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib32)\), RAGEN\-style filtering variants\(Wang et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib29),[2026](https://arxiv.org/html/2606.03108#bib.bib28)\), and a group\-variance filtering baseline\. The third reports the retained EvoTrainer version\. Appendix[E](https://arxiv.org/html/2606.03108#A5)reports search\-budget, training\-compute, and trainer\-agent inference accounting for transparency\. The other autonomous experimentation frameworks listed in Table[1](https://arxiv.org/html/2606.03108#S1.T1)target search spaces and evaluators that differ from LLM RL training and are not directly comparable\.

### 4\.2Does EvoTrainer Improve Training Across Domains?

Table 3:Main results across Math, Coding, and SWE\. Values in parentheses denote absolute improvements of EvoTrainer over the corresponding no\-RL base model\. The AutoResearch\(Karpathy,[2026](https://arxiv.org/html/2606.03108#bib.bib15)\)row reflects score\-only autonomous iteration; the SWE\-9B value matches the documented score\-dominant subpath ceiling \(Section[4\.4](https://arxiv.org/html/2606.03108#S4.SS4)\)\.#### EvoTrainer matches or exceeds prior baselines across all settings\.

Table[3](https://arxiv.org/html/2606.03108#S4.T3)shows that EvoTrainer achieves the strongest score in every reported column\. Against the no\-RL base model, gains are statistically significant in every domain \(paired Wilcoxonp<0\.001p<0\.001throughout; Appendix[F](https://arxiv.org/html/2606.03108#A6)\)\. Against the human\-engineered RL reference, EvoTrainer delivers statistically significant improvements on SWE\-9B \(Δ=\+4\.39\\Delta=\+4\.39,95%95\\%CI\[\+2\.61,\+6\.34\]\[\+2\.61,\+6\.34\],p<0\.001p<0\.001\) and Math \(Δ=\+2\.88\\Delta=\+2\.88,p<0\.001p<0\.001\), while matching the human reference within the bootstrap CI on SWE\-4B and Coding \(p\>0\.1p\>0\.1in both\); full statistics are reported in Appendix[F](https://arxiv.org/html/2606.03108#A6)\. EvoTrainer exceeds AutoResearch’s score\-only iteration in every domain, including Coding where it falls below the no\-RL base\. We therefore treat the SWE\-9B and Math results as the principal quantitative claim, while the SWE\-4B and Coding results show that EvoTrainer attains human\-engineered performance under autonomous control rather than under expert manual tuning\.

#### Retained recipes diverge across domains\.

RAGEN v2 SNR Filtering is the strongest algorithmic baseline in every column, yet EvoTrainer improves on it throughout\. More importantly, the retained EvoTrainer recipes differ substantially across domains—Math toward computation\-aware tool augmentation, Coding toward execution\-aligned reward shaping with variance\-aware filtering, SWE toward a richer behavior\-sensitive training pathway—indicating that EvoTrainer adapts interventions to domain\-specific bottlenecks rather than selecting one universal template\.

### 4\.3How Does EvoTrainer Adapt Across Math and Coding?

![Refer to caption](https://arxiv.org/html/2606.03108v1/x3.png)Figure 3:Per\-version score trajectories on the promoted path for each training condition\. Stars mark the final retained version\. Dashed lines indicatev0v\_\{0\}\(no\-RL base\)\. \(a\) SWE\-9B BC%; \(b\) SWE\-4B BC%; \(c\) Math aggregate Avg@8 over AIME 2024 / AIME 2025 / CNMO 2024; \(d\) Coding Avg@8\.We first develop EvoTrainer in the SWE setting, where long\-horizon agentic interaction demands the richest diagnostic harness\. Math and Coding are then explored with access to the skills and case memory accumulated during SWE\. The full SWE version trajectory is reported in Appendix[D](https://arxiv.org/html/2606.03108#A4); here we focus on how later domains diverge from one another once that reusable infrastructure is available\.

Table 4:Component\-level counterfactual evidence drawn from the EvoTrainer trajectory\. Each row isolates one EvoTrainer component by comparing the trainer’s actual decision with a natural counterfactual already present in the experiment record\.In Math, the early diagnosis identifies a response\-length bottleneck affecting difficult problems\. After length\-budget correction and reward\-side refinement, residual errors concentrate on computation\-heavy cases\. A transferred variance\-aware filter improves group\-level signal but the dominant gap remains computational; the final intervention therefore integrates a Code Interpreter \(per\-benchmark scores in Table[3](https://arxiv.org/html/2606.03108#S4.T3)\)\.

Coding follows a different trajectory\. The early diagnosis reveals a measurement\-side artifact: many zero\-reward outputs are format\-gate / truncation artifacts rather than semantic coding failures, prompting an output\-protocol repair before reward redesign\. The trainer then replaces binary correctness with shaped continuous CR based on passed\-test ratio, recovering partial execution progress lost under binary scoring\. Residual zero\-variance groups motivate cross\-domain reuse: the trainer retrieves StdGroupFilter from SWE and adapts it, yielding the final shaped CR plus group\-variance filtering configuration\. Detailed version traces and the transferred skill are in Appendices[B](https://arxiv.org/html/2606.03108#A2)and[C](https://arxiv.org/html/2606.03108#A3)\.

### 4\.4What Happens Without Each Component of EvoTrainer?

The main results establish performance but not why trainer\-level machinery matters\. We identify three natural counterfactuals that each isolate one EvoTrainer component without a separate sweep \(Table[4](https://arxiv.org/html/2606.03108#S4.T4)\)\. EvoTrainer retained 33 versions on promoted paths plus 20 negative\-evidence candidates; full SWE trajectory and statistical details are in Appendices[D](https://arxiv.org/html/2606.03108#A4)and[F](https://arxiv.org/html/2606.03108#A6)\.

#### Richer diagnostics break the v3 saturation\.

In SWE\-9B, an early score\-dominant subpath shows only incremental gains \(31\.04→32\.89→33\.3331\.04\\rightarrow 32\.89\\rightarrow 33\.33BC% across v1–v3\) through scalar validation comparison and published\-style recipe adaptation before saturating\. Once richer diagnostics, backtesting, and harness\-guided intervention planning are engaged, the trajectory advances to v4 at 36\.30 BC% and v8 at 38\.16 BC%—a\+4\.83\+4\.83gain over v3 that score inspection alone did not reach \(Appendix[D](https://arxiv.org/html/2606.03108#A4)\)\.

#### Harness audit blocks a Git\-leak false promotion\.

Under an uncleaned SWE\-9B repository state, v1 reaches 48\.80 BC%—a number a score\-only loop would promote as a breakthrough\. Harness\-level inspection detects that the model achieves this by accessing reference patches through Git history commands \(git show,git log\); after sanitization, the legitimate v1 score is 31\.04 BC%\. Without the audit, this invalid branch would have been promoted on misleading scalar evidence \(Appendix[D](https://arxiv.org/html/2606.03108#A4)\)\.

#### Skill reuse changes the Coding v9 candidate set\.

Coding v9 illustrates how a retained skill alters the trainer’s candidate set\. Shaped continuous CR at v8 leaves roughly 31% of rollout groups with near\-zero reward variance, matching the SWE v3→\\tov4 regime where StdGroupFilter was validated; the reused filter is therefore a natural candidate\. The trainer also evaluates stronger entropy regularization and a lower KL coefficient but rejects both on mechanism rather than score grounds: the former would compound v8’s length drift, while the latter would not create within\-group variance under a structurally degenerate reward signal\. The retained filter drives v9 to50\.2150\.21Avg@8 \(\+1\.17\+1\.17\) and seeds the Dual\-Level Filter that carries v10 to51\.2951\.29\(\+1\.08\+1\.08\); the same skill also improves Math by\+0\.96\+0\.96\. Without the skill library, this mechanism\-matched candidate would be absent, leaving only rejected local alternatives \(Appendix[C](https://arxiv.org/html/2606.03108#A3)\)\.

## 5Conclusion

We introduce EvoTrainer, an autonomous trainer for evidence\-conditioned training evolution\. Rather than treating autonomous experimentation as repeated recipe edits with scalar score comparison, EvoTrainer treats policy improvement as a cross\-version process in which runnable training versions, diagnostic harnesses, and reusable skills evolve together; the trainer agent autonomously runs the diagnostic reasoning loop while humans bootstrap the workspace and gate costly or consequential actions\.

Across Math, Coding, and SWE, EvoTrainer consistently improves over no\-RL baselines and matches or exceeds the human\-engineered RL references, with the largest gain on SWE\-9B \(38\.1638\.16vs\.33\.7733\.77BC%\)\. The retained strategies diverge across domains, and component\-level counterfactual evidence shows that the gains are not reducible to scalar\-score iteration alone\. The next step in autonomous model training is to build trainers that learn how to interpret, revise, and improve the training process itself\.

## Limitations

The current realization of EvoTrainer is constrained primarily by compute economics\. A complete run consumes approximately4\.0×1084\.0\\times 10^\{8\}trainer\-agent tokens on top of RL training \(Appendix[E](https://arxiv.org/html/2606.03108#A5)\); total SWE GPU\-hours nonetheless remain below the human\-engineered RL reference, suggesting that trainer reasoning substitutes for rather than adds to training\-side search\. The same constraint motivates a single training seed per version, with stochasticity instead reported through per\-task paired bootstrap \(Appendix[F](https://arxiv.org/html/2606.03108#A6)\), following common large\-scale LLM\-RL practice\(Guo et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib9); Yu et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib32)\)\. Autonomous training with harness evolution further depends on trainer models with strong long\-context reasoning and literature\-grounded retrieval; our experiments use Claude Sonnet 4\.6\(Anthropic,[2026](https://arxiv.org/html/2606.03108#bib.bib3)\)\. Retained trajectories also span only 7 to 10 versions per domain, leaving behavior over hundreds of versions—where case memory and skill libraries may require active pruning or hierarchical organization—as future work\.

## Acknowledgments

We gratefully acknowledge the support from Tongyi Lab, Alibaba Group, including computational resources, engineering assistance, and helpful discussions during the development of this work\. We also thank colleagues and collaborators for their feedback and encouragement throughout the project\.

## References

- Agarwal et al\. \(2021\)Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G\. Bellemare\. 2021\.Deep reinforcement learning at the edge of the statistical precipice\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*\.Outstanding Paper Award\.
- Albalak et al\. \(2025\)Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, and Nick Haber\. 2025\.Big\-math: A large\-scale, high\-quality math dataset for reinforcement learning in language models\.*arXiv preprint arXiv:2502\.17387*\.
- Anthropic \(2026\)Anthropic\. 2026\.[Introducing claude sonnet 4\.6](https://www.anthropic.com/news/claude-sonnet-4-6)\.Anthropic announcement\.Accessed: 2026\-05\-21\.
- Badertdinov et al\. \(2025\)Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel\. 2025\.Swe\-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents\.*arXiv preprint arXiv:2505\.20411*\.
- Chen et al\. \(2026\)Guhong Chen, Chenghao Sun, Cheng Fu, Qiyao Wang, Zhihong Huang, Chaopeng Wei, Guangxu Chen, Feiteng Fang, Ahmadreza Argha, Bing Zhao, Xander Xu, Qi Han, Hamid Alinejad\-Rokny, Qiang Qu, Binhua Li, Shiwen Ni, Min Yang, Hu Wei, and Yongbin Li\. 2026\.Beyond quantity: Trajectory diversity scaling for code agents\.*arXiv preprint arXiv:2602\.03219*\.
- Chen et al\. \(2025a\)Kevin Chen, Marco Cusumano\-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl\. 2025a\.Reinforcement learning for long\-horizon interactive llm agents\.*arXiv preprint arXiv:2502\.01600*\.
- Chen et al\. \(2025b\)Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, and Abolfazl Razi\. 2025b\.Dra\-grpo: Your grpo needs to know diverse reasoning paths for mathematical reasoning\.*arXiv preprint arXiv:2505\.09655*\.
- Colas et al\. \(2018\)Cédric Colas, Olivier Sigaud, and Pierre\-Yves Oudeyer\. 2018\.How many random seeds? Statistical power analysis in deep reinforcement learning experiments\.*arXiv preprint arXiv:1806\.08295*\.
- Guo et al\. \(2025\)Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z\. F\. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and 1 others\. 2025\.[DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning](https://doi.org/10.1038/s41586-025-09422-z)\.*Nature*, 645\(8081\):633–638\.
- Henderson et al\. \(2018\)Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger\. 2018\.Deep reinforcement learning that matters\.In*Proceedings of the Thirty\-Second AAAI Conference on Artificial Intelligence \(AAAI\)*\.
- Huang et al\. \(2025\)Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu\. 2025\.R\-zero: Self\-evolving reasoning llm from zero data\.*arXiv preprint arXiv:2508\.05004*\.
- Huang et al\. \(2026\)Yu Huang, Zixin Wen, Yuejie Chi, Yuting Wei, Aarti Singh, Yingbin Liang, and Yuxin Chen\. 2026\.The implicit curriculum: Learning dynamics in rl with verifiable rewards\.*arXiv preprint arXiv:2602\.14872*\.
- Jain et al\. \(2024\)Naman Jain, King Han, Alex Gu, Wen\-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar\-Lezama, Koushik Sen, and Ion Stoica\. 2024\.Livecodebench: Holistic and contamination free evaluation of large language models for code\.*arXiv preprint arXiv:2403\.07974*\.
- Jeddi et al\. \(2026\)Ahmadreza Jeddi, Minh Ngoc Le, Hakki C\. Karaimer, Konstantinos G\. Derpanis, and Babak Taati\. 2026\.Gear: Genetic autoresearch for agentic code evolution\.*arXiv preprint arXiv:2605\.13874*\.
- Karpathy \(2026\)Andrej Karpathy\. 2026\.[autoresearch: Ai agents running research experiments](https://github.com/karpathy/autoresearch)\.GitHub repository\.Accessed: 2026\-05\-24\.
- Lee et al\. \(2026\)Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn\. 2026\.Meta\-harness: End\-to\-end optimization of model harnesses\.*arXiv preprint arXiv:2603\.28052*\.
- Li et al\. \(2023\)Rongao Li, Jie Fu, Bo\-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li\. 2023\.Taco: Topics in algorithmic code generation dataset\.*arXiv preprint arXiv:2312\.14852*\.
- Lin et al\. \(2026\)Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, and Yu\-Gang Jiang\. 2026\.[Agentic harness engineering: Observability\-driven automatic evolution of coding\-agent harnesses](https://arxiv.org/abs/2604.25850)\.*Preprint*, arXiv:2604\.25850\.
- Liu et al\. \(2025a\)Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin\. 2025a\.Understanding r1\-zero\-like training: A critical perspective\.*arXiv preprint arXiv:2503\.20783*\.
- Liu et al\. \(2025b\)Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, and Dandan Tu\. 2025b\.Ghpo: Adaptive guidance for stable and efficient llm reinforcement learning\.*arXiv preprint arXiv:2507\.10628*\.
- Lu et al\. \(2024\)Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha\. 2024\.The ai scientist: Towards fully automated open\-ended scientific discovery\.*arXiv preprint arXiv:2408\.06292*\.
- Ning et al\. \(2026\)Jingjie Ning, Xiaochuan Li, Ji Zeng, Hao Kang, and Chenyan Xiong\. 2026\.Auto research with specialist agents develops effective and non\-trivial training recipes\.*arXiv preprint arXiv:2605\.05724*\.
- Novikov et al\. \(2025\)Alexander Novikov, Ngan Vu, Marvin Eisenberger, Emilien Dupont, Po\-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J\. R\. Ruiz, Abbas Mehrabian, M\. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog\. 2025\.Alphaevolve: A coding agent for scientific and algorithmic discovery\.*arXiv preprint arXiv:2506\.13131*\.
- Qu and Lu \(2026\)Yaonan Qu and Meng Lu\. 2026\.Bilevel autoresearch: Meta\-autoresearching itself\.*arXiv preprint arXiv:2603\.23420*\.
- Qwen Team \(2026\)Qwen Team\. 2026\.[Qwen3\.5\-4b](https://huggingface.co/Qwen/Qwen3.5-4B)\.Hugging Face model card\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\. 2024\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*\.
- Tao et al\. \(2024\)Zhengwei Tao, Ting\-En Lin, Xiancai Chen, Hangyu Li, Yuchuan Wu, Yongbin Li, Zhi Jin, Fei Huang, Dacheng Tao, and Jingren Zhou\. 2024\.A survey on self\-evolution of large language models\.*arXiv preprint arXiv:2404\.14387*\.
- Wang et al\. \(2026\)Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei\-Fei, Lijuan Wang, Yejin Choi, and Manling Li\. 2026\.Ragen\-2: Reasoning collapse in agentic rl\.*arXiv preprint arXiv:2604\.06268*\.
- Wang et al\. \(2025\)Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei\-Fei, Lijuan Wang, Yejin Choi, and Manling Li\. 2025\.RAGEN: Understanding self\-evolution in LLM agents via multi\-turn reinforcement learning\.*arXiv preprint arXiv:2504\.20073*\.
- Wu et al\. \(2026\)Xixi Wu, Qianguo Sun, Ruiyang Zhang, Chao Song, Junlong Wu, Yiyan Qi, and Hong Cheng\. 2026\.Demystifying reinforcement learning for long\-horizon tool\-using agents: A comprehensive recipe\.*arXiv preprint arXiv:2603\.21972*\.
- Yamada et al\. \(2025\)Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha\. 2025\.The ai scientist\-v2: Workshop\-level automated scientific discovery via agentic tree search\.*arXiv preprint arXiv:2504\.08066*\.
- Yu et al\. \(2025\)Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, and 17 others\. 2025\.DAPO: An open\-source LLM reinforcement learning system at scale\.*arXiv preprint arXiv:2503\.14476*\.
- Yu et al\. \(2026\)Zhiyin Yu, Bo Zhang, Qibin Hou, Zhonghai Wu, Xiao Luo, and Lei Bai\. 2026\.Easy samples are all you need: Self\-evolving llms via data\-efficient reinforcement learning\.*arXiv preprint arXiv:2604\.18639*\.
- Zeng et al\. \(2025\)Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, and Hannaneh Hajishirzi\. 2025\.Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments\.*arXiv preprint arXiv:2511\.07317*\.
- Zhang et al\. \(2025\)Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune\. 2025\.Darwin gödel machine: Open\-ended evolution of self\-improving agents\.*arXiv preprint arXiv:2505\.22954*\.
- Zheng et al\. \(2025\)Chujie Zheng, Shixuan Liu, Mingze Li, Xiong\-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin\. 2025\.Group sequence policy optimization\.*arXiv preprint arXiv:2507\.18071*\.

## Appendix ATraining Objective Details

This appendix expands the GRPO\-style training core summarized in Section[3\.6](https://arxiv.org/html/2606.03108#S3.SS6)\. The token\-level policy ratio is

ρi,t\(θ\)=πθ\(oi,t∣q,oi,<t\)πθold\(oi,t∣q,oi,<t\)\.\\rho\_\{i,t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(o\_\{i,t\}\\mid q,o\_\{i,<t\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(o\_\{i,t\}\\mid q,o\_\{i,<t\}\)\}\.We optimize a clipped policy objective with asymmetric Clip\-Higher\(Yu et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib32)\)bounds \(ϵℓ=0\.20\\epsilon\_\{\\ell\}=0\.20,ϵu=0\.27\\epsilon\_\{u\}=0\.27\) and a weak KL term to a fixed reference policy:

ℓi,t\(θ\)\\displaystyle\\ell\_\{i,t\}\(\\theta\)=min⁡\(ρi,tAi,clip\(ρi,t,1−ϵℓ,1\+ϵu\)Ai\)\\displaystyle=\\min\\\!\\left\(\\rho\_\{i,t\}A\_\{i\},\\,\\mathrm\{clip\}\(\\rho\_\{i,t\},1\-\\epsilon\_\{\\ell\},1\+\\epsilon\_\{u\}\)A\_\{i\}\\right\)−βDKL\(πθ∥πref\)i,t\.\\displaystyle\\quad\-\\beta D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\)\_\{i,t\}\.The full objective averages per\-token loss within each trajectory and per\-trajectory within each group:

𝒥\(θ\)=𝔼q,\{oi\}\[1G∑i=1G1\|oi\|∑t=1\|oi\|ℓi,t\(θ\)\]\.\\mathcal\{J\}\(\\theta\)=\\mathbb\{E\}\_\{q,\\\{o\_\{i\}\\\}\}\\\!\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{\|o\_\{i\}\|\}\\sum\_\{t=1\}^\{\|o\_\{i\}\|\}\\ell\_\{i,t\}\(\\theta\)\\right\]\.

## Appendix BAdditional Domain\-Evolution Details

This appendix provides additional evidence for the Math and Coding evolution traces summarized in Section[4\.3](https://arxiv.org/html/2606.03108#S4.SS3)\. The main paper reports only the condensed evolution logic needed for the central argument\. Here we provide representative version scores, diagnostic indicators, and supporting observations that justify the retained interventions\. Intermediate versions that were explored but not retained on the promoted path are omitted for compactness\.

### B\.1Math Evolution Trace

Table[5](https://arxiv.org/html/2606.03108#A2.T5)summarizes the retained Math trajectory\. The evolution follows four stages: identifying a response\-length bottleneck, improving reward\-side training signal quality, testing a transferred variance\-aware filter, and finally introducing a Code Interpreter to address computation\-heavy residual errors\.

Table 5:Math evolution summary\. The aggregate score is the unweighted mean over AIME 2024, AIME 2025, and CNMO 2024\. DGR denotes Dead Group Ratio and TIR denotes Tool Invocation Rate\. Intermediate versions explored but not retained on the promoted path are omitted\.The initial Math diagnosis identifies an upstream truncation bottleneck\. Approximately18%18\\%of validation responses touch the maximum generation length, and difficult competition problems are disproportionately represented among these clipped outputs\. This pattern suggests that a subset of failures arises not from an incorrect reasoning strategy, but from reasoning chains that terminate before completion\.

After length\-budget correction and reward\-side refinement, the retained trajectory reaches v5, where the dead\-group ratio decreases substantially relative to the earlier regime\. However, subsequent failure clustering shows that many remaining errors concentrate on computation\-heavy cases, including enumeration\-heavy problems, large\-number arithmetic, combinatorial expansion, and numerical verification\. The failure clustering motivates a shift from reward\-only intervention toward external computation support\.

The transferred StdGroupFilter in v7 further lowers DGR from approximately0\.210\.21to0\.130\.13, indicating cleaner group\-level signal\. The score gain over v5 remains limited, however, suggesting that the dominant residual bottleneck is not solely low\-variance optimization signal\. Version v8 therefore integrates a Code Interpreter\. In the retained evaluation trace, approximately27%27\\%of validation samples invoke the tool, and the largest gains concentrate on computation\-intensive residual cases\. This concentration justifies retaining the tool augmentation as the final intervention: while the aggregate gain over v7 is modest, no reward\-side intervention had been able to close the same residual computational gap\. The final v8 configuration corresponds to the main\-table Math results \(Table[3](https://arxiv.org/html/2606.03108#S4.T3)\)\. The full transfer logic for StdGroupFilter is discussed in Appendix[C](https://arxiv.org/html/2606.03108#A3)\.

### B\.2Coding Evolution Trace

Table[6](https://arxiv.org/html/2606.03108#A2.T6)summarizes the retained Coding trajectory\. The evolution first repairs a measurement\-side format artifact, then changes the reward shape to preserve partial\-pass information, and finally reuses a SWE\-derived filtering skill to suppress residual low\-information groups\. Intermediate versions that were explored but not retained on the promoted path are omitted for compactness\.

Table 6:Coding evolution summary\. Fmt0 denotes format\-zero rate, MidBand denotes the fraction of samples with0<passratio<10<\\mathrm\{pass\\ ratio\}<1, DGR denotes Dead Group Ratio, and ADiv denotes algorithm\-selection diversity\. Intermediate versions explored but not retained on the promoted path are omitted\.For compactness, the retained trace table omits versions v1 through v7; their main findings are summarized in prose as two intermediate milestones: v3 closes the dominant format\-gate false\-negative channels, while v7 consolidates budget and regularization calibration before stable shaped\-CR training begins at v8\.

The initial Coding diagnosis reveals that part of the apparent model failure is actually a measurement\-side artifact\. Under the early format gate, many generations are assigned zero reward because output truncation prevents the required reasoning/code structure from being closed properly\. This motivates an output\-protocol repair before reward redesign\.

After protocol cleanup, the dominant bottleneck shifts toward sparse binary feedback\. A substantial fraction of failing programs passes some, but not all, test cases\. Binary correctness collapses these partial\-pass trajectories into the same reward value as completely failing programs\. Version v8 therefore replaces binary CR with shaped continuous CR based on the passed\-test ratio\. The resulting MidBand mass rises to approximately0\.340\.34, and DGR decreases to approximately0\.310\.31, indicating that more rollout groups now contain useful within\-group reward variation\.

Residual zero\-variance groups remain even after reward shaping\. This motivates v9, which transfers the StdGroupFilter from SWE into Coding\. DGR further decreases to approximately0\.210\.21\. The final v10 configuration extends StdGroupFilter into a Dual\-Level Filter, reaching 51\.29 Avg@8 with DGR≈0\.18\\approx 0\.18and ADiv≈0\.63\\approx 0\.63\. The Dual\-Level Filter \(adopted only in Coding, where multi\-test rollouts produce nontrivial rates of anomalous trajectories\) combines two complementary filtering stages within a single training step\. \(i\) A*trajectory\-level*filter marks individual rollouts with anomalous status—truncation, max\-length exceedance, or judge error—and excludes them from advantage computation; trajectory\-level stabilization in agentic RL has been studied under StarPO\-S\(Wang et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib29)\)\. \(ii\) The*group\-level*StdGroupFilter \(Appendix[C](https://arxiv.org/html/2606.03108#A3)\) discards low\-variance rollout groups via an EMA\-tracked top\-ppthreshold over within\-group reward standard deviation, building on the dynamic\-sampling principle of DAPO\(Yu et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib32)\)and the SNR\-aware variance filtering of RAGEN\-2\(Wang et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib28)\)\. The retained trace therefore supports the interpretation that Coding improvement comes from a sequence of measurement repair, signal revision, and reusable skill transfer rather than from a single static recipe\. The full transfer analysis is provided in Appendix[C](https://arxiv.org/html/2606.03108#A3)\.

### B\.3Format\-Gate Repair in Coding

The Coding format\-gate repair is a concrete example of harness evolution uncovering a measurement artifact before model\-side intervention\. In the affected evaluation trace, validation checkpoints at steps0,2525, and375375together contain 700 sampled responses\. The response\-token distribution is heavily concentrated at the former generation limit: thep50p50,p90p90, andp99p99token lengths all reach 20,480 tokens\. Moreover,62\.1%62\.1\\%of samples contain at least 18,000 tokens, and56\.1%56\.1\\%fail to close the reasoning block with</think\>\.

Under a stricter format\-fail criterion that requires both missing</think\>and missing closed code block, 352 samples qualify; all of them satisfyntok≥18,000n\_\{\\mathrm\{tok\}\}\\geq 18\{,\}000\. This concentration shows that the zero\-reward format failures are not randomly distributed semantic mistakes; they are tightly coupled to truncation under the previous output budget\. The subsequent repair increasesmax\_new\_tokensfrom 20,480 to at least 32,768, allowing the format gate to operate on more complete outputs\. This cleanup precedes reward redesign and prevents later training decisions from being driven by an avoidable measurement artifact\.

### B\.4Diagnostic Metric Glossary

Table[7](https://arxiv.org/html/2606.03108#A2.T7)summarizes the diagnostic indicators that appear in the Math and Coding auxiliary analyses\.

Table 7:Glossary of diagnostic indicators used in the Math and Coding auxiliary analyses\.

## Appendix CCross\-Domain Reuse of StdGroupFilter

The StdGroupFilter provides a concrete example of reusable skill accumulation in EvoTrainer, as introduced in Section[3\.4](https://arxiv.org/html/2606.03108#S3.SS4)and applied across domains in Section[4\.3](https://arxiv.org/html/2606.03108#S4.SS3)\. Rather than storing only natural\-language observations, the trainer preserves validated operational mechanisms that can be retrieved and adapted in later domains\. StdGroupFilter originates in SWE, where the trainer diagnoses low\-information rollout groups with insufficient reward variance\. Because the mechanism operates on group\-level statistics rather than SWE\-specific reward semantics, it later becomes reusable in both Math and Coding\.

### C\.1Origin in SWE

In the SWE analysis pipeline, the trainer identifies a recurring failure mode: rollout groups with insufficient reward variance contribute little useful relative\-learning signal while still consuming training compute\. Group\-level filtering by reward variance has been studied under Dynamic Sampling in DAPO\(Yu et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib32)\)and SNR\-Aware Filtering in RAGEN\-2\(Wang et al\.,[2026](https://arxiv.org/html/2606.03108#bib.bib28)\); building on these ideas, the trainer instantiates a filtering utility that tracks the reward\-standard\-deviation distribution via an EMA\-tracked top\-ppthreshold, retaining informative groups more flexibly than a rigid fixed\-threshold rule\.

The retained configuration uses:

- •group\_filter\_mode: top\_p
- •group\_keep\_ratio: 0\.9
- •group\_min\_keep\_ratio: 0\.5
- •group\_ema\_decay: 0\.1
- •group\_min\_std\_threshold: 0\.01

The filter operates on group\-level reward statistics rather than on SWE\-specific components such as CR, SBE, ETT, or IF\. This abstraction is central to its later cross\-domain transfer\. Table[8](https://arxiv.org/html/2606.03108#A3.T8)summarizes these cross\-domain transfers\.

Table 8:Cross\-domain reuse of StdGroupFilter\. The same group\-level filtering skill is developed in SWE, later retrieved in Math and Coding, and ultimately retained as part of the final Coding configuration\.
### C\.2Transfer to Math

In Math, the trainer recognizes that residual zero\-variance groups after reward\-side improvements are structurally similar to the earlier SWE diagnosis\. This transfer is appropriate because the filter depends on within\-group reward dispersion rather than on task\-specific reward semantics\. The corresponding version\-level changes are reported in Table[5](https://arxiv.org/html/2606.03108#A2.T5)\.

### C\.3Transfer to Coding

The Coding trace later exhibits a related optimization pathology\. Even after shaped continuous CR recovers partial\-pass signal, many sampled groups remain effectively low\-information: some groups fail uniformly, some pass the same fixed subset of easy tests, and others are already uniformly solved\. These cases motivate a group\-level filter that can complement, rather than replace, component\-level reward shaping\. The trainer retrieves the previously validated StdGroupFilter from the reusable skill library and adapts it to the Coding pipeline\. The corresponding version\-level context is reported in Table[6](https://arxiv.org/html/2606.03108#A2.T6)\.

## Appendix DAdditional SWE Framework and Trajectory Diagnostics

This appendix provides supplementary evidence for Section[4\.4](https://arxiv.org/html/2606.03108#S4.SS4)\. We first expand the score\-dominant SWE\-9B path discussed in the main text, then report the Git\-leak audit, the IF\-Judge dead\-group backtest, and additional behavior\-level degeneration cases that motivate richer harness diagnostics\.

### D\.1Score\-Dominant SWE\-9B Path versus Full EvoTrainer

Table[9](https://arxiv.org/html/2606.03108#A4.T9)expands the score\-dominant SWE\-9B path summarized in the main paper\. The early path relies on scalar BC% comparison and published\-style recipe adaptation\. It improves over the no\-RL baseline but saturates before reaching the stronger EvoTrainer regimes\.

Table 9:Detailed SWE\-9B path underlying the framework\-level contrast in the main text\. The score\-dominant early path saturates at 33\.33 BC%, whereas the full EvoTrainer trajectory reaches 38\.16 BC%\.The training core was adjusted in two early steps: GSPO at v1 \(flat correctness reward, \+0\.85 BC% over no\-RL\) was switched to GRPO at v2, with Clip\-Higher\(Yu et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib32)\)added from v3\. We do not present the v1→\\rightarrowv2 step as a controlled algorithm comparison since RAGEN\-style filtering was added in the same iteration\. The substantive intervention occurs at v3→\\rightarrowv4: by v3, three reward and filtering refinements had saturated at 33\.33 BC%, prompting a coupled reward redesign—procedural rewards \(SBE, ETT\), binary correctness, and EMA\-based group filtering on top of GRPO\. After v4 reached 36\.30 BC%, the trainer explored more aggressive variants on top of this branch at v5–v7, each of which regressed below v4 on validation; the trainer therefore reverted to the v4 branch as the baseline for further iteration\. v8 then extends v4 with an instruction\-following judge, after dead\-group analysis and reward backtesting indicate that the existing reward structure leaves substantial training variance unused\.

### D\.2Harness Audit Prevents False Promotion under Git Leakage

This audit case illustrates why scalar validation scores alone are insufficient for autonomous training decisions\. In an early SWE\-9B v1 environment, the model could access Git history commands such asgit showandgit log\. This unintentionally exposed reference\-patch information and produced an apparent validation score of 48\.80 BC%\. A score\-only training loop would have treated this branch as a decisive improvement and promoted it as the best version\.

The evolving SWE harness identifies the anomaly through behavior\-level inspection of tool usage and repository interactions\. In particular, the audit detects anomalous Git\-command usage together with a late jump in response length\. After the environment is sanitized by removing access to the leaked Git history, the legitimate v1 performance is 31\.04 BC%, which is the value used throughout the main experiments and the score reported in the retained SWE evolution path\. Table[10](https://arxiv.org/html/2606.03108#A4.T10)reports the contaminated and sanitized evaluation results\.

Table 10:Git\-leak audit for SWE\-9B v1\. The contaminated score would falsely dominate the retained trajectory under scalar\-score\-only selection, while harness inspection prevents invalid promotion\.This case is central to our distinction between score\-driven iteration and trainer\-level diagnosis\. The harness changes the branch\-selection decision itself by rejecting an apparently high\-scoring but invalid trajectory\.

### D\.3Dead\-Group Rescue with the IF LLM Judge

Version v4 already reaches 36\.30 BC%, but trainer\-side diagnostics show that the dead\-group ratio remains high\. Approximately half of the rollout groups still produce no useful within\-group reward variation, which limits the effectiveness of group\-relative optimization\.

To test whether an additional reward dimension can recover useful variance, the trainer introduces an instruction\-following signal and performs retroactive reward backtesting on historical rollouts from the promoted SWE\-9B branch\. These rollouts are re\-scored after augmenting the original correctness signal with a0\.10\.1\-weighted instruction\-following term, and within\-group reward variance is then recomputed\. Among groups that are dead under the original correctness signal,45%45\\%regain non\-zero variance after adding the instruction\-following term\. Table[11](https://arxiv.org/html/2606.03108#A4.T11)summarizes the resulting dead\-group reduction\.

Table 11:Dead\-group reduction across selected SWE\-9B versions\. The IF LLM Judge is introduced after backtesting indicates that it restores useful variance in otherwise low\-information groups\.This diagnostic chain clarifies why v8 is more than a generic additive reward term\. Its introduction follows a mechanism\-level analysis of residual dead groups and is validated through offline reward backtesting before becoming part of the retained training recipe\.

### D\.4Echo Trap in SWE\-4B

The preceding subsections focus on SWE\-9B\. We next report complementary SWE\-4B analyses that expose additional behavior\-level failure modes observed during training\. The Echo Trap refers to a trajectory\-degeneration pattern in which turns become excessively long and repetitive while validation score may still appear temporarily competitive\. Table[12](https://arxiv.org/html/2606.03108#A4.T12)summarizes a representative SWE\-4B branch\.

Table 12:Representative SWE\-4B Echo Trap degeneration\. VLong\>70\>70denotes the fraction of trajectories exceeding 70 turns\. The validation score rises between steps 75 and 125, but behavior\-level diagnostics show severe trajectory elongation and a sharp increase in filtered degenerate rollouts\. Step\-75 behavior\-level breakdowns were not yet standardized in the original trace; BC% and average turns are available at that checkpoint, while the remaining indicators were retained only in aggregate form\.Table[12](https://arxiv.org/html/2606.03108#A4.T12)makes the contrast explicit: BC% rises from 29\.71 to 33\.28, while average turn count more than doubles \(36–37 to 76\.3\) and the filtered degenerate\-trajectory count rises eightfold \(23 to 194\)\. Headline validation improvement alone is insufficient to certify a branch as healthy\.

The associated behavior\-level filter distinguishes three degeneration modes:

1. 1\.Abandonment: trajectories with at most 3 turns and no edit action;
2. 2\.Idle failure: trajectories with no edit action and zero correctness reward;
3. 3\.Echo Trap: trajectories exceeding 60 turns while still receiving zero correctness reward\.

The filtered count grows from 23 in the early regime to 194 in the collapse regime, consistent with a major shift toward degenerate behavior rather than ordinary score variance\. After rejecting this echo\-trapped branch through behavior\-level audit, the retained SWE\-4B configuration reported in Table[3](https://arxiv.org/html/2606.03108#S4.T3)comes from an alternative non\-collapsing branch\.

### D\.5Multiplicative Efficiency Factor Collapse \(SWE\-4B Branches\)

All version labels \(v9, v10, v11\) in this subsection refer to SWE\-4B branches investigated after the Echo Trap diagnosis\. Following that diagnosis, a three\-way branch test investigates how the system should respond after long\-trajectory degeneration is detected\. The multiplicative efficiency factor takes the form

efficiency=1\.0−0\.3×num\_turnsmax\_turns,\\mathrm\{efficiency\}=1\.0\-0\.3\\times\\frac\{\\mathrm\{num\\\_turns\}\}\{\\mathrm\{max\\\_turns\}\},withefficiency∈\[0\.7,1\.0\]\\mathrm\{efficiency\}\\in\[0\.7,1\.0\]\. In v9, this factor multiplies a staircase correctness reward; in v11, it multiplies continuous CR\. Version v10 serves as a control branch with continuous CR but without the multiplicative efficiency factor\.

#### v9: Efficiency Only\.

Table[13](https://arxiv.org/html/2606.03108#A4.T13)shows that the branch initially reduces average turn count, but then collapses into near\-trivial one\-turn behavior\. BC% falls to approximately zero, dead groups approach100%100\\%, and diversity disappears\.

Table 13:Collapse under multiplicative efficiency shaping in SWE v9\.
#### v11: Continuous CR plus Efficiency\.

Table[14](https://arxiv.org/html/2606.03108#A4.T14)shows the same failure pattern even when continuous CR replaces the staircase reward\. The branch again converges toward one\-turn behavior and near\-total dead\-group saturation\.

Table 14:Collapse under continuous CR combined with multiplicative efficiency shaping in SWE v11\.
#### v10: Continuous CR without Efficiency\.

Table[15](https://arxiv.org/html/2606.03108#A4.T15)provides the counterfactual branch\. Continuous CR without the multiplicative efficiency term remains viable over hundreds of steps and does not collapse into one\-turn behavior\.

Table 15:Control branch with continuous CR but without the multiplicative efficiency factor\.The v9/v11 failures, together with the v10 control, show that a seemingly modest reward multiplier can produce catastrophic behavioral incentives in agentic RL\. EvoTrainer retains these branches as reusable negative evidence rather than treating them as disposable failed runs\.

## Appendix ESearch\-Budget and Compute Accounting for Human Baselines

This appendix documents the fairness conditions and search\-budget accounting for the human\-engineered RL references used in the main results\. While the comparison is not perfectly compute\-matched, it is transparent enough to show that the EvoTrainer gains are not explained by privileged access to data, infrastructure, or weaker human baselines\.

### E\.1Shared Experimental Conditions

The human and EvoTrainer conditions share the following properties:

- •Both begin from the same project codebase and official training scripts\.
- •Both use the same task assets, data splits, model families, hardware environment, and evaluation protocol\.
- •All reported numbers use the same validation setup and the same seed convention\.
- •The paper reports the strongest human\-engineered RL configuration observed under this shared stack, rather than an average or first attempt\.

Under these conditions, the main difference is the decision\-making process used to select subsequent interventions: human RL engineering versus the trainer\-agent workflow in EvoTrainer\.

### E\.2SWE Search Budget

Table[16](https://arxiv.org/html/2606.03108#A5.T16)summarizes the approximate search\-budget comparison for SWE\. The human condition retains more total versions, runs more total training steps, and uses more total GPU\-hours, while EvoTrainer achieves stronger final scores on both SWE\-4B and SWE\-9B\.

Table 16:Approximate SWE search\-budget comparison between EvoTrainer and the human\-engineered RL line\. EvoTrainer retains SWE\-9B and SWE\-4B versions on independent trajectories, while the human\-engineered RL line tracks one combined version\-count estimate across both model sizes\.The comparison indicates that EvoTrainer does not obtain its SWE gains by simply consuming a larger raw training budget\. The human line uses more total steps and more total GPU\-hours, yet EvoTrainer achieves stronger retained models, particularly in SWE\-9B\.

### E\.3Math and Coding Budget Alignment

Math and Coding are compared under aligned version\-level budgets\. In Math, EvoTrainer and the human\-engineered RL line each retain 8 major versions on the promoted path\. In Coding, each retains 10 major versions\. Because the retained version counts and training setup are aligned within each domain, total training time is also broadly comparable\. Table[17](https://arxiv.org/html/2606.03108#A5.T17)summarizes the version\-level alignment\.

Table 17:Retained version\-level budget alignment for Math and Coding\.Under these aligned iteration budgets, EvoTrainer reaches 84\.17 / 73\.33 / 81\.94 on AIME 2024 / AIME 2025 / CNMO 2024 \(vs\. 80\.83 / 71\.67 / 77\.78 for the human reference\) and 51\.29 Coding Avg@8 \(vs\. 50\.71\)\.

These comparisons support the interpretation that the advantage of EvoTrainer is not restricted to a single unusually favorable search budget, but appears under both lower\-budget SWE search and aligned\-budget single\-turn domains\.

### E\.4Trainer\-Agent Inference Usage

In addition to RL training compute, the full EvoTrainer development workflow consumed approximately4\.0×1084\.0\\times 10^\{8\}trainer\-agent tokens for diagnosis, evidence retrieval, backtesting, and intervention planning\. We report this separately from GPU\-hour accounting because it reflects LLM\-mediated analysis and planning rather than policy\-training compute\.

## Appendix FHead\-to\-Head Significance Analysis

This appendix expands the significance numbers cited in Section[4\.2](https://arxiv.org/html/2606.03108#S4.SS2)\. We report three families of paired statistical comparisons: head\-to\-head against the human\-engineered RL best \(the headline comparison\), head\-to\-head against the no\-RL base model \(to confirm that improvements over the seed are not domain\-specific artefacts\), and within\-trajectory comparisons grounding the component\-level counterfactual evidence in Section[4\.4](https://arxiv.org/html/2606.03108#S4.SS4)\.

#### Protocol\.

Each per\-task observation is the per\-prompt Avg@8 score over88independent rollouts; observations therefore live on the discrete grid\{0/8,1/8,…,8/8\}\\\{0/8,\\,1/8,\\,\\ldots,\\,8/8\\\}\. For two methodsAAandBBwe compute the per\-task paired differencedi=siA−siBd\_\{i\}=s^\{A\}\_\{i\}\-s^\{B\}\_\{i\}and report \(i\) the mean differenceΔ=1n∑idi\\Delta=\\tfrac\{1\}\{n\}\\sum\_\{i\}d\_\{i\}, \(ii\) a95%95\\%confidence interval from a paired bootstrap withB=10,000B=10\{,\}000resamples drawn at the task level\(Agarwal et al\.,[2021](https://arxiv.org/html/2606.03108#bib.bib1); Colas et al\.,[2018](https://arxiv.org/html/2606.03108#bib.bib8)\), and \(iii\) a two\-sided Wilcoxon signed\-rankpp\-value\(Henderson et al\.,[2018](https://arxiv.org/html/2606.03108#bib.bib10)\)\. For Math we use a single stratified bootstrap that draws within each of the three sub\-benchmarks \(AIME 2024, AIME 2025, CNMO 2024\) and aggregates the per\-task differences across alln=78n=78problems\. The Math aggregate scores reported in Table[18](https://arxiv.org/html/2606.03108#A6.T18)are therefore problem\-weighted means over the 78 paired observations and may differ by a small amount from the unweighted sub\-benchmark means reported in Appendix[B](https://arxiv.org/html/2606.03108#A2)\. This protocol substitutes evaluation\-time stochasticity for the training\-seed stochasticity that is infeasible to repeat at our compute scale, following the dominant LLM\-RL reporting practice\(Guo et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib9); Yu et al\.,[2025](https://arxiv.org/html/2606.03108#bib.bib32)\)\.

Table 18:EvoTrainer vs\. Human\-engineered RL\. Per\-task paired bootstrap \(B=10,000B=10\{,\}000\) and two\-sided Wilcoxon signed\-rank tests on per\-task Avg@8\. The Math row aggregates AIME 2024 \(n=30n\{=\}30\), AIME 2025 \(n=30n\{=\}30\) and CNMO 2024 \(n=18n\{=\}18\) under a stratified bootstrap\.Table 19:EvoTrainer vs\. no\-RL base model\. Same protocol as Table[18](https://arxiv.org/html/2606.03108#A6.T18)\.
#### Findings\.

The headline gain on SWE\-9B and the aggregate Math improvement \(Table[18](https://arxiv.org/html/2606.03108#A6.T18)\) are both statistically significant at thep<0\.001p<0\.001level\. The smaller margins on SWE\-4B and Coding sit within the bootstrap confidence interval and are not statistically distinguishable from the human\-engineered RL reference; we therefore frame these settings as EvoTrainer matching the human reference rather than exceeding it\. The EvoTrainer\-vs\-Base comparison \(Table[19](https://arxiv.org/html/2606.03108#A6.T19)\) shows that EvoTrainer delivers significant improvements over the no\-RL starting point in every domain, including the two settings where the gap to the human\-engineered RL reference is statistically inconclusive\. This rules out the alternative reading that the autonomous training loop fails to add value in those domains: in SWE\-4B and Coding, EvoTrainer reproduces the gains achieved by hand\-engineered RL without expert manual tuning, rather than failing to improve over the base\.

#### Within\-Trajectory Paired Comparisons\.

For the component\-level counterfactual evidence in Section[4\.4](https://arxiv.org/html/2606.03108#S4.SS4), Table[20](https://arxiv.org/html/2606.03108#A6.T20)reports within\-trajectory paired statistics under the same per\-task paired bootstrap and Wilcoxon protocol as Table[18](https://arxiv.org/html/2606.03108#A6.T18)\.

Table 20:Within\-trajectory paired comparisons referenced in Section[4\.4](https://arxiv.org/html/2606.03108#S4.SS4)\.
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

Similar Articles

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning

HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems

Submit Feedback

Similar Articles

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning
HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems