Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier
Summary
Introduces PROPEL, a solver-amortized framework that trains a lightweight activation probe to predict solver pass rates, enabling efficient training of task generators for RL without costly solver rollouts. The method improves generation at the learnable frontier across math, code, and software-engineering tasks.
View Cached Full Text
Cached at: 06/18/26, 05:40 AM
# Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier
Source: [https://arxiv.org/html/2606.18284](https://arxiv.org/html/2606.18284)
1\]Vmax 2\]Goodfire AI
Connor WattsRoger Creus CastanyerGeoffrey BradwayMaxwill LinAugustine N\. Mavor\-ParkerMatthew Daborn\-Sargent\[\[[\{lorenz, augustine, matthew\}@vmax\.ai](https://arxiv.org/html/2606.18284v1/mailto:%7Blorenz,%20augustine,%20matthew%[email protected])
\(June 10, 2026\)
###### Abstract
The limiting resource for training agents via reinforcement learning \(RL\) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model\. As reasoning and agentic models improve, fixed task distributions saturate, while naive synthetic generation yields tasks that are trivial, impossible, or ill\-posed\. Training a task generator with RL to optimize validity and learnability can address this bottleneck, but direct optimization requires repeated solver rollouts per candidate\. For software\-engineering \(SWE\) tasks, a single rollout can take tens of minutes; solver\-in\-the\-loop generator training is intractable\. We introduce PROPEL, a solver\-amortized framework for training task generators at the targeted solve rate\. PROPEL trains a lightweight activation probe on a one\-time labeled corpus of generated tasks and solver outcomes\. The probe predicts target\-solver pass rate from a frozen generator reference model and serves as a proxy for solve rate during generator optimization, reducing generator evaluation to a single forward pass\. Across math, code, and software\-engineering at multiple model scales, PROPEL shifts generation toward the targeted solve rate: for coding, tasks generated at the learnable frontier increase from10\.1%→20\.0%10\.1\\%\\rightarrow 20\.0\\%for aQwen2\.5\-3B\-Instructsolver and from5\.3%→12\.6%5\.3\\%\\rightarrow 12\.6\\%for aQwen2\.5\-7B\-Instructsolver\. For SWE, PROPEL increases the share of generations at the targeted solve rate from9\.8%9\.8\\%to19\.6%19\.6\\%forQwen3\.5\-27Bon repositories not seen during training of probe and generator\.
\\correspondence
## 1Introduction
1Data Collection2Probe Training3RL Training4EvaluationBaseGeneratorfrozenGenerated tasksActivationshhSolver Modelk=8k\{=\}8trials per taskslow & expensiveDifficulty labels0/80/8×\\bm\{\\times\}too hard1/81/8✓\\checkmark2/82/8✓\\checkmark3/83/8✓\\checkmark8/88/8×\\bm\{\\times\}saturated11–3@83@8positives\!bottleneck for RLSolver label amortized; solver\-in\-the\-loop RL is prohibitive\.Targetsy=1y\\\!=\\\!1✓\\checkmark11–3@83@8y=0y\\\!=\\\!0×\\bm\{\\times\}\{\(hi,yi\)\}i=1N\\\{\(h\_\{i\},\\,y\_\{i\}\)\\\}\_\{i=1\}^\{N\}Training dataProbebinary classifierTrain probeto predicty∈\{0,1\}y\\\!\\in\\\!\\\{0,1\\\}fromhih\_\{i\}\.Reference Modelfrozen copy of baseGeneratedtaskGeneratorπθ\\pi\_\{\\theta\}TrainedProbePredicteddifficulty0\.680\.68h\(t\)h\(t\)RL updateProbe\-only reward– no solver trials inside the RL loop\.pre\-RLpost\-RL0123≥4\\geq 4Solver passes \(@k=8k\{=\}8\)Solver eval:tasks shift toward11–3@83@8\.
Figure 1:Pipeline overview:\(1\) A base generator produces a one\-time pool of tasks that are labeled with an expensive solver\. \(2\) A probe is trained to predict those difficulty labels from the generator’s hidden states\. \(3\) During RL, the generator proposes tasks; a frozen reference model produces activations and the trained probe converts them into reward, so the solver is never invoked inside the inner loop\. \(4\) The trained generator is finally evaluated against a held\-out solver to confirm that probe\-driven shaping translates into true difficulty gains\.Reinforcement learning on verifiable rewards \(RLVR\) has become the dominant recipe for eliciting reasoning and agentic behavior from language models\(Guoet al\.,[2025a](https://arxiv.org/html/2606.18284#bib.bib183); Lambert,[2025](https://arxiv.org/html/2606.18284#bib.bib130); Liuet al\.,[2025b](https://arxiv.org/html/2606.18284#bib.bib128)\)\. Progress under this recipe is gated by the supply of training tasks\. As policies improve, fixed task distributions saturate, and further gains require harder tasks that remain discriminative at the current capability frontier\. Hand\-curated benchmarks cannot keep pace, and naive synthetic generation tends to produce tasks that are either trivially solvable or ill\-posed\. A natural alternative is to train a*generator*model with RL, rewarding it for producing tasks that are well\-formed and appropriately difficult for a target solver\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib63); Weiet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib30)\)\. The implied objective is*discriminability*, tasks on which the solver is challenged but does not completely fail\(Weiet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib30)\)\.
However, evaluating a candidate task requires running the solver, and in agentic settings this can be prohibitively expensive\. On SWE\-bench\-style tasks\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.18284#bib.bib50); Yanget al\.,[2026](https://arxiv.org/html/2606.18284#bib.bib34)\)a single rollout can take tens of minutes since it involves repository navigation, tool calls, and test execution\. A reliable difficulty signal needs many such rollouts per candidate to estimate solve rate\. Embedding this loop inside generator RL makes training on meaningful distributions infeasible\. The same bottleneck appears, more mildly, in competitive math and code generation, where solver trials are cheaper but still costly and variance remains high\. Standard RLVR pipelines do not scale to objectives whose verifier is itself an expensive stochastic agent; synthetic generation of tasks scales doubly poorly as solve rate, complexity, and cost\-to\-solve are unfavorable\.
We introducePROPEL–*ProbeRewards forOptimizingProblems at theEdge ofLearning*– a solver\-amortized framework for training task generators\. PROPEL builds on Reinforcement Learning from Feature Rewards \(RLFR;Prasadet al\.,[2026](https://arxiv.org/html/2606.18284#bib.bib61)\), which uses interpretability features from hidden states to supervise open\-ended generation, and adapts that recipe to task generation with two changes that the agentic setting demands: a multi\-step trajectory formulation for software\-engineering tasks, and explicit treatment of fixed\-probe mode collapse\. A small probe is trained once on a one\-time labeled corpus of \(task, solver\-outcome\) pairs read out from a frozen reference generator’s activations; during RL it replaces live solver rollouts as the reward, collapsing per\-step cost from many solver trials to a single forward pass \(see Figure[1](https://arxiv.org/html/2606.18284#S1.F1)\)\. The construction exploits a well\-documented property of language models, that quantities of interest are often represented internally even when the model cannot act on them reliably at generation time\(Orgadet al\.,[2024](https://arxiv.org/html/2606.18284#bib.bib103); Zhanget al\.,[2025a](https://arxiv.org/html/2606.18284#bib.bib69)\)\. Provided that a candidate task’s well\-formedness, solvability, and difficulty\-calibration is decodable from generator hidden states, the probe gives a dense, near\-free signal that stands in for the true objective long before any solver rollout would confirm it\.
We show that PROPEL breaks the solver bottleneck in generator training\. Training a generator against activation probes rather than live solver trials yields tasks that are harder and more discriminative for the target solver while requiring less than half of the solver trials\. PROPEL approximately doubles the rate at which the generator produces learnable\-frontier tasks across math, code induction, and software engineering tasks and across solver model sizes \(e\.g\. for code induction10\.1%→20\.0%10\.1\\%\\to 20\.0\\%,\+98%\+98\\%relative, targeting aQwen2\.5\-3B\-Instructsolver;5\.3%→12\.6%5\.3\\%\\to 12\.6\\%,\+138%\+138\\%onQwen2\.5\-7B\-Instruct, see Figure[2](https://arxiv.org/html/2606.18284#S1.F2)\)\. To mitigate the diversity loss that arises when optimizing against a single fixed probe, we apply worst\-case optimization \(WCO\) and adversarial co\-evolution of the probe\. Our contributions are as follows\.
Figure 2:PROPEL significantly outperforms base in terms of utility of generated tasks\. Utility is measured based on theQwen2\.5\-7B\-Instructsolver for AZR and math, and theQwen3\.5\-27Bsolver on held\-out OOD repositories for SWE\. On math utility is reported on the post\-oracle scored tasks\. Error bars are±1\\pm 1standard error across RL seeds \(single seed for SWE\)\.\{contributions\}
PROPEL: feature rewards for task generation, including multi\-step settings\.We utilize RLFR probe rewards in a solver\-amortized generator\-RL pipeline that replaces solver\-in\-the\-loop verification with an activation probe, collapsing per\-step reward cost fromkksolver rollouts to a single forward pass\. PROPEL makes generator RL tractable in regimes such as agentic SWE, where solver\-in\-the\-loop training is not\.
Characterizing mode collapse and mitigating it with worst\-case optimization\.We observe mode collapse to a semantic topic under fixed\-probe optimization and show that worst\-case optimization can mitigate it while maintaining\+86%\+86\\%relative frontier\-rate gain over base\. We additionally investigate regularization, and adversarial probe co\-evolution\.
Empirical gains across math, code\-induction, and SWE at multiple model scales\.On code induction PROPEL approximately doubles the rate at which the generator produces learnable\-frontier tasks for the solver \(10\.1%→20\.0%10\.1\\%\{\\to\}20\.0\\%on33B,\+98%\+98\\%relative;5\.3%→12\.6%5\.3\\%\{\\to\}12\.6\\%on77B,\+138%\+138\\%\)\. On Math the same recipe shifts the post\-strict\-oracle conditional yield substantially \(\+11\+11pp forQwen2\.5\-7B\-Instruct,\+17\+17pp onQwen2\.5\-3B\-Instruct\)\. On the significantly more costly and complex SWE domain, PROPEL doubles the rate of learnable\-frontier bugs targeting aQwen3\.5\-27Bsolver\.
Evaluating cold transfer of probes across model families\.We demonstrate cold transfer of a fixed probe across generator families, showing that a probe trained onQwen3\.5\-4Bdrives substantial utility gains when swapped toMistral\-7B\-Instruct\-v0\.3andPhi\-3\.5\-mini\-instructwithout any per\-family retuning, evidence that the encoded utility signal generalizes across model families\.
## 2Related Work
#### Synthetic task generation and self\-play\.
A growing body of work trains task generators that target the solver’s learnable frontier, with two prior systems most directly informing our design\. Absolute Zero\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib63)\)self\-plays a proposer/solver pair across deduction, abduction, and induction tasks; we adopt its induction format and its core observation that the proposer needs a difficulty signal that is neither trivial nor unsolvable\. Self\-play SWE\-RL\(Weiet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib30)\)ports this to bug injection vs\. repair, with an injection reward that peaks for solver pass rates near the middle of the0–11range, directly motivating our solver pass\-at\-KKutility\. Beyond these, task\-generation work spans math and symbolic reasoning\(Lianget al\.,[2025a](https://arxiv.org/html/2606.18284#bib.bib65); Liet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib84); Liuet al\.,[2025a](https://arxiv.org/html/2606.18284#bib.bib82); Lacombeet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib83)\)and software engineering\(Sonwaneet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib24); Panet al\.,[2024](https://arxiv.org/html/2606.18284#bib.bib28); Jainet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib25); Xieet al\.,[2026](https://arxiv.org/html/2606.18284#bib.bib27); Zhanget al\.,[2025c](https://arxiv.org/html/2606.18284#bib.bib29); Wanget al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib86); Zhuet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib26)\), with adjacent work on synthetic\-data quality\(Chen and Zhong,[2025](https://arxiv.org/html/2606.18284#bib.bib31)\), solver\-side training\(Daet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib57)\), and abstraction generation\(Quet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib72)\)\. Of the methods that train a task generator with RL, the reward is generally computed by running the target solver on each candidate; we replace those rollouts with a single forward pass through a probe\.
#### Probes and internal\-state rewards\.
Probes have been used to predict reasoning correctness\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.18284#bib.bib69); David,[2025](https://arxiv.org/html/2606.18284#bib.bib79); Cencerradoet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib78)\), score best\-of\-NNcandidates\(Guoet al\.,[2025b](https://arxiv.org/html/2606.18284#bib.bib70)\), and calibrate judges\(Radharapuet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib80)\)\. More recently, a series of works have explored using model internals as supervision during training\(Zhanget al\.,[2026](https://arxiv.org/html/2606.18284#bib.bib71); Lianget al\.,[2025b](https://arxiv.org/html/2606.18284#bib.bib81); Prasadet al\.,[2026](https://arxiv.org/html/2606.18284#bib.bib61)\)\. Most relevant to ours,Prasadet al\.\([2026](https://arxiv.org/html/2606.18284#bib.bib61)\)introduce*RL from Feature Rewards*\(RLFR\), a framework using probes over model internals as scalable reward functions for open\-ended tasks\. While RLFR isolates features associated with hallucinations, we train a probe to predict a task’s training utility \(as defined in section[3\.2](https://arxiv.org/html/2606.18284#S3.SS2)\) from the internal features of a task generator\. For SWE task generation, this requires us to extend RLFR to multi\-turn trajectories\.
#### Reward overoptimization and mode collapse\.
Optimizing any learned reward proxy is subject to Goodhart effects\(Gaoet al\.,[2023](https://arxiv.org/html/2606.18284#bib.bib68); Kwaet al\.,[2024](https://arxiv.org/html/2606.18284#bib.bib90); Moskovitzet al\.,[2023](https://arxiv.org/html/2606.18284#bib.bib95)\)and mode collapse, with KL regularization itself capable of driving collapse rather than preventing it\(GX\-Chenet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib76)\)\. Iterated RLHF retrains the proxy on the policy’s collapsed outputs\(Wolfet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib96)\), and verbalized sampling preserves diversity at the decoding level\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.18284#bib.bib91)\)\. Our adversarial probe\-co\-evolution follows the iterated\-feedback recipe\. Probe\-high but solver\-failed outputs become negatives for the next probe\.
## 3Probe Rewards for Optimizing Problems at the Edge of Learning
We study reinforcement learning for*task\-generator*language models across three domains of increasing complexity, math competition tasks, code\-induction puzzles \(Absolute Zero Reasoner \- AZR\), and software\-engineering tasks \(SWE\)\. We first introduce the formal setting then discuss the shared framework PROPEL and provide the per\-domain specifics in Section[4\.2](https://arxiv.org/html/2606.18284#S4.SS2)\.
### 3\.1Problem Setting
A generator policyπθ\\pi\_\{\\theta\}synthesizes tasksx∼πθ\(⋅∣c\)x\\sim\\pi\_\{\\theta\}\(\\cdot\\mid c\)given a contextccthat contains a short instruction, examples, or represents access to a Docker image with a passing test suite\. A good task should meet the following two criteria: It should be a valid task, i\.e\., a math task should have a clear statement, a code\-induction task should run, and a bug should apply cleanly and make a previously passing test fail\. Second, it should be useful for training: a target solver should sometimes solve it and sometimes fail\.
More formally, the low\-cost validity check is a predicate𝒲:𝒳→\{0,1\}\\mathcal\{W\}:\\mathcal\{X\}\\to\\\{0,1\\\}, where𝒲\(x\)=1\\mathcal\{W\}\(x\)=1means thatxxis syntactically valid and can be graded or executed\. The expensive signal is a utility labelU:𝒳→ℝU:\\mathcal\{X\}\\to\\mathbb\{R\}, computed from the solver’s solve counts, or in the case of SWE, solver rollouts graded against tests\. Our objective is to trainπθ\\pi\_\{\\theta\}to increase𝔼x∼πθ\[U\(x\)\]\\mathbb\{E\}\_\{x\\sim\\pi\_\{\\theta\}\}\[U\(x\)\]while avoiding expensive calls toUUinside the RL loop\.
### 3\.2Utility Signal for Task Generators
The main utility label asks whether a task lies at the target solver’s learnable\-frontier\. For a solverSS, andKKattempts, letμS\(x\)\\mu\_\{S\}\(x\)be the mean solve rate for group size ofKK\. The utility is defined as:
US\(x\)=𝕀\[a≤μS\(x\)≤b\]\.U\_\{S\}\(x\)=\\mathbb\{I\}\[a\\leq\\mu\_\{S\}\(x\)\\leq b\]\.\(1\)Following the observation byWeiet al\.\([2025](https://arxiv.org/html/2606.18284#bib.bib30)\), the optimal solve rate range to target with a group sizeK=8K=8isa=1/8a=1/8andb=3/8b=3/8\. A task is positive if the solver solves it11,22, or33times out of88\. A task solved0times is too hard for that solver, and a task solved44to88times is already too easy\. This gives the generator credit for tasks that are learnable but not saturated\. For SWE we reduce the number of solver trials toK=3K=3to mitigate the computational burden, and use the utility signalUS\(x\)=𝕀\[1/3≤μS\(x\)≤2/3\]U\_\{S\}\(x\)=\\mathbb\{I\}\[1/3\\leq\\mu\_\{S\}\(x\)\\leq 2/3\]\. This slightly broader solve rate band is chosen to account for the small number of tasks generated by the base model that yield exactly one successful attempt and increases the pool of positive samples slightly\.
### 3\.3Probe\-reward RL
EvaluatingUUduring RL requiresKKsolver trials per task, which is slow, requires additional memory for solver inference during RL, and each attempt must be verified by a grader, which can be costly, especially in the case of SWE tasks where the grader is derived from a repository’s test suite\. Instead, we approximateUUwith anactivation probefϕf\_\{\\phi\}\. The probe is a small classifier trained on a dataset of generated tasks whose expensive labels have been computed offline\. During RL, a newly generated task is given to a frozen copy of the base generator, denotedπref\\pi\_\{\\mathrm\{ref\}\}\. We read hidden states from this frozen reference model, not from the policyπθ\\pi\_\{\\theta\}being updated\. For a generated taskxx, the reference model processes the rendered task text or trajectory, and we extract
hL\(x\)=Pool\(HiddenL\(x;πref\)\),h\_\{L\}\(x\)=\\mathrm\{Pool\}\(\\mathrm\{Hidden\}\_\{L\}\(x;\\,\\pi\_\{\\mathrm\{ref\}\}\)\),\(2\)whereHiddenL\(x;πref\)\\mathrm\{Hidden\}\_\{L\}\(x;\\,\\pi\_\{\\mathrm\{ref\}\}\)are the layer\-LLhidden states ofπref\\pi\_\{\\mathrm\{ref\}\}on inputxxandPool\(⋅\)\\mathrm\{Pool\}\(\\cdot\)is a pooling rule\. The probe maps this vector to either a positive\-class probabilitypϕ\(x\)p\_\{\\phi\}\(x\)or a raw logit scoreℓϕ\(x\)\\ell\_\{\\phi\}\(x\)\. The probability is useful for calibrated screening; the logit keeps more dynamic range for RL rewards\. Freezing the reference model makes the reward a function of the generated task rather than a function of the changing hidden states of the RL policy\.
#### RL reward signal\.
The main reward signal uses validity as a gate and the probe as the ranking signal\. Invalid generations receive a fixed penaltyrbad<0r\_\{\\mathrm\{bad\}\}<0; valid generations receive the raw probe logit:
Rhard\(x,ϕ\)\\displaystyle R\_\{\\text\{hard\}\}\(x,\\phi\)=\{rbad,𝒲\(x\)=0,ℓϕ\(x\),𝒲\(x\)=1\.\\displaystyle=\\begin\{cases\}r\_\{\\mathrm\{bad\}\},&\\mathcal\{W\}\(x\)=0,\\\\ \\ell\_\{\\phi\}\(x\),&\\mathcal\{W\}\(x\)=1\.\\end\{cases\}\(3\)The validity predicate decides whether a task can be used at all, while the probe ranks valid tasks by estimated frontier utility\. Ablations on the reward signal change only this combination rule\.*Probe\-only*drops the validity gate and rewardspϕ\(x\)p\_\{\\phi\}\(x\)\.*Soft\-gated*rewards valid tasks with a clipped probabilityclip\(pϕ\(x\),0\.1,0\.95\)\\mathrm\{clip\}\(p\_\{\\phi\(x\)\},0\.1,0\.95\)instead of a raw logit, while still penalizing invalid tasks\.
Additionally, we investigate an ensemble of probes optimized via worst\-case optimization \(WCO\) to mitigate topic collapse and overfitting to specific features of an individual probe\(Costeet al\.,[2024](https://arxiv.org/html/2606.18284#bib.bib9)\):
RWCO\(x,ϕ\)=\{rbad,𝒲\(x\)=0,minjℓϕj\(x\),𝒲\(x\)=1\.R\_\{\\text\{WCO\}\}\(x,\\phi\)=\\begin\{cases\}r\_\{\\mathrm\{bad\}\},&\\mathcal\{W\}\(x\)=0,\\\\ \\min\_\{j\}\\ell\_\{\\phi\_\{j\}\}\(x\),&\\mathcal\{W\}\(x\)=1\.\\end\{cases\}\(4\)
#### Probe evaluation\.
The trained probes are evaluated based on standard classification metrics \(Accuracy, Balanced Accuracy, F1\-score\) and calibration metrics \(ECE\)\. We find that probe accuracy and calibration are not enough to quantify successful RL training\. A probe that classifies held\-out examples well but assigns almost the same reward to every task sampled from the current policy results in almost no signal during RL\. We therefore measure reward variance under the base policyRVP\(fϕ,πref\)=Varx∼πref\[R\(x,ϕ\)\],\\mathrm\{RVP\}\(f\_\{\\phi\},\\pi\_\{\\mathrm\{ref\}\}\)=\\mathrm\{Var\}\_\{x\\sim\\pi\_\{\\mathrm\{ref\}\}\}\\big\[R\(x,\\phi\)\\big\],whereRRis the exact reward value used during RL, including validity gates and ensemble aggregation\. RVP is measured before RL on a fixed sample from the base policy\. We use it as a selection diagnostic together with calibration, and balanced accuracy\. We validate the RVP in Appendix[15](https://arxiv.org/html/2606.18284#S15)\.
#### Generator RL\.
The generator policy is trained with GRPO for math and code\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18284#bib.bib142)\), while SWE uses DAPO style advantage calculations with soft long context penalties\(Yuet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib136)\)\. All methods use the probe\-based rewardRRfor generator optimization\. For each contextcc, the policy samples a group ofGGcandidate tasks\{xi\}i=1G∼πθ\(⋅∣c\)\\\{x\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid c\)\. Each candidate is passed through the frozen referenceπref\\pi\_\{\\mathrm\{ref\}\}to extracthL\(xi\)h\_\{L\}\(x\_\{i\}\), from which the probe produces the scalar rewardR\(xi,ϕ\)R\(x\_\{i\},\\phi\)\. Implementation details for RL training of the generator are specified in Appendix[16](https://arxiv.org/html/2606.18284#S16)\. BecauseRRdepends only on the rendered task and the frozenπref\\pi\_\{\\mathrm\{ref\}\}, the policy cannot improve its reward by shifting the activations the probe reads\(Prasadet al\.,[2026](https://arxiv.org/html/2606.18284#bib.bib61)\)\. The full pipeline \(Figure[1](https://arxiv.org/html/2606.18284#S1.F1)\) is: collect an offline task pool, computeUUfor that pool, extract frozen\-reference activations, train and select probes, and run RL withRR\.
## 4Evaluation
### 4\.1Metrics
The core downstream metric is the realized utility of fresh generations: we sample tasks from the trained policy, then run the solverKKtimes per task and report the share whose realized solve count lands in the target band defined byUSU\_\{S\}\. The probe score is never used for evaluation; every result recomputesUUfrom solver rollouts\. We additionally report three text\-level diversity diagnostics over all generations per condition:*Self\-BLEU\-3*\(Montahaeiet al\.,[2019](https://arxiv.org/html/2606.18284#bib.bib2)\),*Distinct\-3*\(unique 3\-gram ratio; higher = more diverse\), and*top\-topic rate*\(share of generations in the most common per\-domain topic\)\.
### 4\.2Domains
#### Math and AZR\.
For both math and code, the generator isQwen3\.5\-4B\(Qwen Team,[2026](https://arxiv.org/html/2606.18284#bib.bib3)\)and we train probes against labelsUS\(x\)=𝕀\[1/8≤μS\(x\)≤3/8\]U\_\{S\}\(x\)=\\mathbb\{I\}\[1/8\\leq\\mu\_\{S\}\(x\)\\leq 3/8\]for two solver sizes,Qwen2\.5\-3B\-Instruct\(Qwenet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib7)\)andQwen2\.5\-7B\-Instruct\. The two domains differ only in the task format and how the ground truth solution is established\.
In math, the generator is prompted with two in\-context examples from GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.18284#bib.bib4)\), MATH\-500\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.18284#bib.bib33)\), or AIME\-style\(Dekonincket al\.,[2026](https://arxiv.org/html/2606.18284#bib.bib5)\)pools and writes a competition\-style task; a solver attempt is correct when its extracted answer is symbolically equivalent to a verified reference\. References use a strict two\-oracle policy:Qwen2\.5\-32B\-Instruct\(Qwenet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib7)\)andPhi\-4\(Abdinet al\.,[2024b](https://arxiv.org/html/2606.18284#bib.bib6)\)each make two attempts, and a task is scored only when all four answers agree\.
In AZR we adopt the function\-induction format ofZhaoet al\.\([2025](https://arxiv.org/html/2606.18284#bib.bib63)\): given a function seed, the generator produces a triplex=\(f,\{u1,…,un\},m\)x=\(f,\\\{u\_\{1\},\\ldots,u\_\{n\}\\\},m\)of a Python function, input tuples, and a natural\-language hint; the solver sees five input\-output examples plus the hint and must reproduceff’s behavior on the remaining 5 hidden inputs\. Since the generator commits to executable ground truth, no oracle filter is needed\. Validity gates, banned imports, sandboxing details, and the full generator prompts are in Appendix[18](https://arxiv.org/html/2606.18284#S18)\.
Both domains are trained with GRPO and LoRA adapters\(Huet al\.,[2021](https://arxiv.org/html/2606.18284#bib.bib215)\); KL coefficients areβ=0\.05\\beta=0\.05\(AZR\) andβ=0\.1\\beta=0\.1\(Math\)\. Full hyperparameters in Appendix[16](https://arxiv.org/html/2606.18284#S16)\.
#### SWE\.
The SWE setting is a bug\-writing task on real Python repositories from the SWE\-smith registry\(Yanget al\.,[2026](https://arxiv.org/html/2606.18284#bib.bib34)\), partitioned into repo\-disjoint probe\-training, RL\-training, and evaluation splits \(Appendix[13](https://arxiv.org/html/2606.18284#S13)\)\. The generator,Qwen3\.5\-27B\(Qwen Team,[2026](https://arxiv.org/html/2606.18284#bib.bib3)\), acts as a multi\-turn bash agent with a custom harness \(see Appendix[13\.3](https://arxiv.org/html/2606.18284#S13.SS3)\) inside a sandboxed repository and must introduce a behavior\-changing bug while leaving tests unmodified\. The verifier checks that the patch applies, tests are unchanged, the suite still runs, and at least one previously passing test fails\. For every valid bug we runK=3K=3solver trials with the same model under the OpenCode harness\(Anomaly,[2026](https://arxiv.org/html/2606.18284#bib.bib8)\), each in a fresh sandbox with the bug\-introducing patch produced by the generator applied; a trial is solved iff the solver’s final edit passes the verifier suite\.
Unlike math and AZR, a SWE task is a full agent trajectory: we extract activations per generated agent turn and aggregate the turn sequence before probing\. The probe sweep varies reference\-model layer, turn aggregation, probe architecture, and class\-imbalance handling, and selects a fixed probe before RL \(see Appendix[14\.2](https://arxiv.org/html/2606.18284#S14.SS2)\)\.
RL uses SkyRL\(Caoet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib62)\)with vLLM rollout engines\(Kwonet al\.,[2023](https://arxiv.org/html/2606.18284#bib.bib1)\), and per\-repository Docker sandboxes; training rollouts are scored by the repository verifier and the selected activation probe, while the expensiveK=3K=3OpenCode\(Anomaly,[2026](https://arxiv.org/html/2606.18284#bib.bib8)\)solver labels are reserved for offline probe training and held\-out evaluation\. The generator is trained for3030steps with up to3535tool\-using turns per trajectory\. Invalid generations receive a fixed reward of−0\.2\-0\.2\. Full hyperparameters are in Appendix[16](https://arxiv.org/html/2606.18284#S16)\.
## 5Probe Training Data and Selection
This section reports the probe\-training data and the selected probes used by PROPEL\. The implementation\-level sweep, activation\-extraction choices, optimizer settings, and selection details are in Appendix[14](https://arxiv.org/html/2606.18284#S14)\.
### 5\.1Probe\-training data
For each \(domain, target solver\) pair we collect one balanced probe\-training set\. For SWE the positive class is too sparse to discard data, so we instead train with a class\-reweighted loss on all1,7571\{,\}757rows of the Probe\-Train split\. The resulting dataset sizes are reported in Table[1](https://arxiv.org/html/2606.18284#S5.T1); additional details can be found in Appendices[14](https://arxiv.org/html/2606.18284#S14)and[13](https://arxiv.org/html/2606.18284#S13)\.
Table 1:Probe\-training dataset sizes per \(domain, target solver\)\. Positives are tasks in the targeted solve\-rate band defined in Section[3\.2](https://arxiv.org/html/2606.18284#S3.SS2)\. Math/AZR use balanced datasets; SWE retains all rows and reweights the loss to balance the two classes\.Table 2:Probes used as the RL reward, one per \(domain, target solver\)\. Validation balanced accuracy and ECE are computed from the random 80/10/10 split for math and AZR, and from 5\-fold cross\-repository CV for SWE \(mean across folds, averaged over 3 seeds\)\. All ECE values are pre\-calibration; the SWE probe additionally has a post\-hoc temperature calibrator applied at RL\-scoring time\. RVP is the reward variance under the base policy, computed on a fixed sample ofn=512n\{=\}512base\-policy completions; SWE RVP is omitted due to computational cost\.
### 5\.2Probe Selection
For math and AZR we sweep reference model layer, token pooling, and linear*vs\.*MLP probe heads\. For SWE, where each task is a multi\-turn agent trajectory, we also sweep trajectory\-level pooling and class\-imbalance handling\. We only use the selected probe for each \(domain, target solver\)\. Probe selection uses held\-out balanced accuracy and calibration, with reward variance under policy \(RVP\) as the final Math/AZR tiebreaker on a fixed sample ofn=512n\{=\}512base\-policy completions\. For SWE, the held\-out split is cross\-repository: each fold holds out entire repositories, so the selected probe must transfer beyond repositories seen during probe training\. Table[2](https://arxiv.org/html/2606.18284#S5.T2)lists the selected probes used as the RL reward\.
The RVP of the selected Math/AZR probes spans more than an order of magnitude \(0\.0080\.008to0\.1450\.145\)\. Downstream RL gain \(Table[4](https://arxiv.org/html/2606.18284#S6.T4)\) tracks RVP more closely than validation balanced accuracy: AZR, the highest\-RVP domain, shows the largest gain, while math \(lowest RVP\) shows slightly smaller gains\.
## 6Results
#### Direct solver\-in\-the\-loop RL is more costly and performs worse\.
To isolate and quantify the solver bottleneck, we train the generator with the ground truth utility signal evaluated during training\. This baseline is denoted as solver\-in\-the\-loop \(SIL\) RL and is run on the AZR domain\. The results shown in Table[3](https://arxiv.org/html/2606.18284#S6.T3)confirm that online solver feedback can improve the generator, but at significantly higher cost: training for 30 steps consumes53,66453\{,\}664solver trials, and requires a co\-locatedQwen2\.5\-3B\-InstructvLLM solver server during RL\. Note that this is in the cheaper AZR domain where solver trials are orders of magnitude faster than in SWE\. In contrast, PROPEL uses22,59222\{,\}592solver trials collected offline to train the probe, then replaces the online solver with the frozen\-reference forward pass plus a small probe head\. It makes no solver calls during generator RL, while achieving a utility improvement over the base model more than twice as large as SIL \(Table[3](https://arxiv.org/html/2606.18284#S6.T3)\)\.
Table 3:On the AZR domain PROPEL outperforms the more expensive solver\-in\-the\-loop \(SIL\) baseline reaching higher utility lift over the baseline despite requiring fewer solver trials and less peak memory\. PROPEL results are reported forQwen2\.5\-3B\-Instructacross 3 random seeds with one standard deviation\.
#### PROPEL shifts generators toward the edge of learning\.
Across the three domains, PROPEL yields significant utility gains over the baseline generator \(see Figure[2](https://arxiv.org/html/2606.18284#S1.F2)and Table[4](https://arxiv.org/html/2606.18284#S6.T4)\)\. On code induction, PROPEL roughly doubles theUSU\_\{S\}across the output distribution at both solver sizes\. On math, PROPEL yields a1\.7×1\.7\\timeslift at both target sizes on the oracle validated output distribution\. The multi\-turn SWE setting reproduces these gains at the substantially largerQwen3\.5\-27Bgenerator and solver scale and exhibits strong out\-of\-distribution generalization along two axes\. First, the RL training repositories are themselves out\-of\-distribution for the probe, yet PROPEL still improves utility on them by a clear margin\. Second, on a set of 550 tasks generated from a held\-out set of 11 repositories never seen during RL, the same checkpoint delivers a2\.0×2\.0\\timesimprovement\. Together these results show that the gains transfer beyond the probe’s training distribution and beyond the RL training distribution, rather than reflecting overfitting at either stage\. This confirms that a notion of solve rate and task difficulty is encoded in the hidden activations of LLMs across domains and at various model scales and that task generators can be consistently optimized to maximizeUS\(⋅\)U\_\{S\}\(\\cdot\)via probe rewards\.
Table 4:Evaluation results on AZR, math, and SWE\. Utility isUS\(⋅\)U\_\{S\}\(\\cdot\)defined in Equation[1](https://arxiv.org/html/2606.18284#S3.E1)\. Valid is the rate at which generated tasks pass the validity predicate𝒲\(⋅\)\\mathcal\{W\}\(\\cdot\)\(task specification rules for math, sandbox execution for AZR, bug verification for SWE\)\. Self\-BLEU\-3 and Distinct\-3 measure text diversity; top\-topic is the share of the most common semantic topic \(top\-edited\-file rate for SWE\)\. The SWE block reports theQwen3\.5\-27Bgenerator on its 11 RL\-training repositories and on 11 held\-out \(OOD\) repositories with 50 tasks per repo\. Bold marks the within\-\(domain, solver\) winner per metric\. Values are mean±\\pmstandard error across RL seeds \(SWE is a single run\)\.
#### Mode collapse and mitigations\.
PROPEL optimizing a single probe achieves the most significant gains in terms of utility, but results in stronger semantic concentration\. On AZR withQwen2\.5\-3B\-Instructas the solver, it concentrates approximately74%74\\%of generated tasks on the single semantic topicsorting\_order\(Table[4](https://arxiv.org/html/2606.18284#S6.T4)\)\. While it is expected that the distribution of tasks falling inside the targeted utility band is narrower, we attempt to mitigate the loss of diversity via an Ensemble of two probes\(Costeet al\.,[2024](https://arxiv.org/html/2606.18284#bib.bib9)\)\. PROPEL with worst case optimization \(WCO\) as defined in Equation[4](https://arxiv.org/html/2606.18284#S3.E4)reduces top\-topic concentration on AZR with theQwen2\.5\-7B\-Instructsolver from0\.670\.67to0\.540\.54while maintaining significant gains in terms of utility over the base generator\. Similarly in the case of theQwen2\.5\-3B\-Instructsolver, WCO results in a top\-topic decrease from0\.740\.74to0\.690\.69while retaining a\+71%\+71\\%utility increase over the base generator\. Additionally, we investigated adversarial probe co\-evolution, in which false positives of the probe \(valid tasks scored as useful but falling outside the true utility band\) are mined as negatives to train an auxiliary probe that then constrains the generator’s reward\. A small investigation on AZR \(Appendix[12](https://arxiv.org/html/2606.18284#S12)\) recovers most of the probe utility and validity gains without the accompanying semantic collapse, providing preliminary evidence that this failure mode is mineable rather than terminal\.
#### Cold transfer of the probe to different generator families\.
The probe and theQwen3\.5\-4Breference model are held fixed, while the trainable generator policy is swapped to eitherMistral\-7B\-Instruct\-v0\.3\(Jianget al\.,[2023](https://arxiv.org/html/2606.18284#bib.bib211)\)orPhi\-3\.5\-mini\-instruct\(Abdinet al\.,[2024a](https://arxiv.org/html/2606.18284#bib.bib212)\), with all other hyperparameters inherited from the in\-family RL training onQwen3\.5\-4B; that is, we perform no per\-family tuning\. This is an explicit cold\-transfer of the probe, designed to bound how far the recipe carries without re\-tuning\. The results are reported in Figure[3](https://arxiv.org/html/2606.18284#S6.F3)\. On AZR with theQwen2\.5\-3B\-Instructsolver,Mistral\-7B\-Instruct\-v0\.3shows a clear increase even without tuning: format validity rises from39\.6%39\.6\\%to67\.4%67\.4\\%and utility from5\.2%5\.2\\%to9\.6%9\.6\\%after3030steps, resulting in a similarly strong increase relative to its base model as the in\-family reference\. The utility increase when transferring toPhi\-3\.5\-mini\-instructis more modest on AZR\. On math, PROPEL onMistral\-7B\-Instruct\-v0\.3yields a significant increase, with utility rising from7\.4%7\.4\\%to26\.4%26\.4\\%, though diversity metrics degrade\. The strong transfer of the fixed probe toMistral\-7B\-Instruct\-v0\.3and more modest transfer toPhi\-3\.5\-mini\-instructprovide evidence that the encoded utility signal is represented across models and that probe retraining and new data collection may not be required when changing the trainable generator\.
Figure 3:Cross\-family probe transfer \(n=1024n=1024generations per condition\)\. The probe and theQwen3\.5\-4Breference model are fixed; only the trainable policy varies\. Hyperparameters are inherited from the in\-family training runs with no per\-family tuning\. TheQwen∗\(in\-family\) results initialize the trained policy from the reference model\.
#### Ablations on KL coefficient and reward composition\.
On math with theQwen2\.5\-7B\-Instructsolver and on AZR withQwen2\.5\-3B\-Instructwe ablate the KL regularization strength in Figure[4\(a\)](https://arxiv.org/html/2606.18284#S6.F4.sf1)\. This informed the choice ofβ=0\.1\\beta=0\.1for math and0\.050\.05for AZR\. The results show a clear tradeoff between high utility and low diversity / high topic collapse at low values ofβ\\betawith utility and topic collapse decreasing as KL increases\. At KL0\.020\.02the training run on math had fully collapsed\. Next we ablate the reward composition \(*probe\-only*,*soft\-gate*,*hard\-gate*\) optimized during RL\. The results in Figure[4\(b\)](https://arxiv.org/html/2606.18284#S6.F4.sf2)show that the*hard\-gate*variant is consistently the best joint operating point: it matches or beats the others on utility while leaving diversity competitive\.
\(a\)KL ablation\.
\(b\)Reward composition ablation\.
Figure 4:Varying the KL regularization strength trades off utility with diversity and topic collapse\. The ablation on reward composition shows its impact on the optimization dynamics also depending on the probe\.
## 7Conclusion
#### Summary\.
We have shown that activation probes trained on a frozen reference model provide a single\-forward\-pass surrogate for solve rate, and that this surrogate is strong enough to serve as the RL reward for a task generator\. Optimizing against the probe shifts generator output toward the learnable frontier of the target solver: on coding tasks the rate at which AZR generators produce frontier\-band tasks roughly doubles\. Additionally, we provide preliminary evidence that the probe and reference model transfer across generator families, suggesting that the activation signal reflects properties of the task itself rather than of any particular generator\.
#### Limitations\.
Our experiments cover three domains \(math, code induction, software engineering\); broader coverage of families, scales, and task types is needed before treating these results as general\. The probe is trained once on the reference generator’s activations and held fixed during RL; under sustained policy drift this surrogate can degrade, and the mode\-collapse behavior characterized in Section[6](https://arxiv.org/html/2606.18284#S6)is one consequence\. Our mitigations \(probe ensembling, adversarial co\-evolution\) are partial and trade efficacy for diversity\. Finally, the labeled corpus that anchors the probe still requires solver rollouts to construct; PROPEL amortizes this cost across generator training but does not eliminate it\.
#### Future work\.
The most immediate extension is broader empirical coverage: more domains, more generator and solver families, and larger scales where the cost asymmetry between solver rollouts and probe forward passes is more favorable still\.
Additionally, throughout this work we treat task generation as a static distributional target: produce tasks near the learnable frontier of a*fixed*solver\. In practice, the solver is itself being trained, and the utility of a task depends on what the solver has already mastered\. A natural extension is to condition the generator on a trace of recently\-attempted tasks and their outcomes so the policy can adapt its target distribution as the solver progresses\. Doing so induces an inner\-outer loop that is expensive to optimize jointly; how to stabilize it without compounding instabilities across loops is an open question\. Using cheap signal from model internals amortizes the dominant cost of this meta\-loop \(per\-iteration solver rollouts\) and could provide a tractable path to open\-ended learning systems\(Hugheset al\.,[2024](https://arxiv.org/html/2606.18284#bib.bib214)\)\.
Many tasks we ultimately care about \(scientific discovery, theorem proving, long\-horizon agentic work\) have rewards too sparse for direct RL to be effective\. A possible extension is to use the generator to propose*goals*rather than full tasks, producing intermediate objectives whose utility decomposes along axes such as achievability, novelty, and relevance\(Diaz\-Boneet al\.,[2025](https://arxiv.org/html/2606.18284#bib.bib88)\)\.
#### Outlook\.
Verifiable rewards have carried the current generation of reasoning and agentic models, but the supply of tasks where rewards exist and are cheap and clean is finite\. As models saturate existing benchmarks and environments, the bottleneck shifts to producing the right tasks at the right time \(a problem that increasingly looks like an RL problem itself\)\. Internal\-state rewards offer a way to break the solver\-in\-the\-loop cost barrier that otherwise makes generator RL intractable, and a route into domains where ground\-truth verifiers do not exist\.
## References
- M\. Abdin, J\. Aneja, H\. Awadalla, A\. Awadallah, A\. A\. Awan, N\. Bach, A\. Bahree, A\. Bakhtiari, J\. Bao, H\. Behl, A\. Benhaim, M\. Bilenko, J\. Bjorck, S\. Bubeck, M\. Cai, Q\. Cai, V\. Chaudhary, D\. Chen, D\. Chen, W\. Chen, Y\. Chen, Y\. Chen, H\. Cheng, P\. Chopra, X\. Dai, M\. Dixon, R\. Eldan, V\. Fragoso, J\. Gao, M\. Gao, M\. Gao, A\. Garg, A\. D\. Giorno, A\. Goswami, S\. Gunasekar, E\. Haider, J\. Hao, R\. J\. Hewett, W\. Hu, J\. Huynh, D\. Iter, S\. A\. Jacobs, M\. Javaheripi, X\. Jin, N\. Karampatziakis, P\. Kauffmann, M\. Khademi, D\. Kim, Y\. J\. Kim, L\. Kurilenko, J\. R\. Lee, Y\. T\. Lee, Y\. Li, Y\. Li, C\. Liang, L\. Liden, X\. Lin, Z\. Lin, C\. Liu, L\. Liu, M\. Liu, W\. Liu, X\. Liu, C\. Luo, P\. Madan, A\. Mahmoudzadeh, D\. Majercak, M\. Mazzola, C\. C\. T\. Mendes, A\. Mitra, H\. Modi, A\. Nguyen, B\. Norick, B\. Patra, D\. Perez\-Becker, T\. Portet, R\. Pryzant, H\. Qin, M\. Radmilac, L\. Ren, G\. de Rosa, C\. Rosset, S\. Roy, O\. Ruwase, O\. Saarikivi, A\. Saied, A\. Salim, M\. Santacroce, S\. Shah, N\. Shang, H\. Sharma, Y\. Shen, S\. Shukla, X\. Song, M\. Tanaka, A\. Tupini, P\. Vaddamanu, C\. Wang, G\. Wang, L\. Wang, S\. Wang, X\. Wang, Y\. Wang, R\. Ward, W\. Wen, P\. Witte, H\. Wu, X\. Wu, M\. Wyatt, B\. Xiao, C\. Xu, J\. Xu, W\. Xu, J\. Xue, S\. Yadav, F\. Yang, J\. Yang, Y\. Yang, Z\. Yang, D\. Yu, L\. Yuan, C\. Zhang, C\. Zhang, J\. Zhang, L\. L\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, and X\. Zhou \(2024a\)Phi\-3 technical report: a highly capable language model locally on your phone\.External Links:2404\.14219,[Link](https://arxiv.org/abs/2404.14219)Cited by:[§6](https://arxiv.org/html/2606.18284#S6.SS0.SSS0.Px4.p1.7)\.
- M\. Abdin, J\. Aneja, H\. Behl, S\. Bubeck, R\. Eldan, S\. Gunasekar, M\. Harrison, R\. J\. Hewett, M\. Javaheripi, P\. Kauffmann, J\. R\. Lee, Y\. T\. Lee, Y\. Li, W\. Liu, C\. C\. T\. Mendes, A\. Nguyen, E\. Price, G\. de Rosa, O\. Saarikivi, A\. Salim, S\. Shah, X\. Wang, R\. Ward, Y\. Wu, D\. Yu, C\. Zhang, and Y\. Zhang \(2024b\)Phi\-4 technical report\.External Links:2412\.08905,[Link](https://arxiv.org/abs/2412.08905)Cited by:[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px1.p2.1)\.
- Anomaly \(2026\)OpenCode: the open source coding agent\.Note:[https://github\.com/anomalyco/opencode](https://github.com/anomalyco/opencode)GitHub repository, accessed 2026\-05\-05Cited by:[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px2.p3.4)\.
- S\. Cao, S\. Hegde, D\. Li, T\. Griggs, S\. Liu, E\. Tang, J\. Pan, X\. Wang, A\. Malik, G\. Neubig, K\. Hakhamaneshi, R\. Liaw, P\. Moritz, M\. Zaharia, J\. E\. Gonzalez, and I\. Stoica \(2025\)SkyRL\-v0: train real\-world long\-horizon agents via reinforcement learning\.Cited by:[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px2.p3.4)\.
- I\. V\. M\. Cencerrado, A\. P\. Masdemont, A\. G\. Hawthorne, D\. D\. Africa, and L\. Pacchiardi \(2025\)No answer needed: predicting llm answer accuracy from question\-only linear probes\.External Links:2509\.10625,[Link](https://arxiv.org/abs/2509.10625)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Chen and V\. Zhong \(2025\)SynQuE: estimating synthetic dataset quality without annotations\.arXiv preprint arXiv:2511\.03928\.Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168,[Link](https://arxiv.org/abs/2110.14168)Cited by:[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px1.p2.1)\.
- T\. Coste, U\. Anwar, R\. Kirk, and D\. Krueger \(2024\)Reward model ensembles help mitigate overoptimization\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=dcjtMYkpXx)Cited by:[§3\.3](https://arxiv.org/html/2606.18284#S3.SS3.SSS0.Px1.p2.1),[§6](https://arxiv.org/html/2606.18284#S6.SS0.SSS0.Px3.p1.6)\.
- J\. Da, C\. Wang, X\. Deng, Y\. Ma, N\. Barhate, and S\. Hendryx \(2025\)Agent\-rlvr: training software engineering agents via guidance and environment rewards\.External Links:2506\.11425,[Link](https://arxiv.org/abs/2506.11425)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
- J\. David \(2025\)Temporal predictors of outcome in reasoning language models\.External Links:2511\.14773,[Link](https://arxiv.org/abs/2511.14773)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Dekoninck, N\. Jovanović, T\. Gehrunger, K\. Rögnvalddson, I\. Petrov, C\. Sun, and M\. Vechev \(2026\)Beyond benchmarks: matharena as an evaluation platform for mathematics with llms\.External Links:2605\.00674,[Link](https://arxiv.org/abs/2605.00674)Cited by:[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px1.p2.1)\.
- L\. Diaz\-Bone, M\. Bagatella, J\. Hübotter, and A\. Krause \(2025\)DISCOVER: automated curricula for sparse\-reward reinforcement learning\.External Links:2505\.19850,[Link](https://arxiv.org/abs/2505.19850)Cited by:[§7](https://arxiv.org/html/2606.18284#S7.SS0.SSS0.Px3.p3.1)\.
- L\. Gao, J\. Schulman, and J\. Hilton \(2023\)Scaling laws for reward model overoptimization\.InProceedings of the 40th International Conference on Machine Learning,External Links:[Link](https://proceedings.mlr.press/v202/gao23h.html)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025a\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2606.18284#S1.p1.1)\.
- J\. Guo, Z\. Wu, H\. Yang, and P\. S\. Yu \(2025b\)Mining intrinsic rewards from llm hidden states for efficient best\-of\-n sampling\.External Links:2505\.12225,[Link](https://arxiv.org/abs/2505.12225)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px2.p1.1)\.
- A\. GX\-Chen, J\. Prakash, J\. Guo, R\. Fergus, and R\. Ranganath \(2025\)KL\-regularized reinforcement learning is designed to mode collapse\.External Links:2510\.20817,[Link](https://arxiv.org/abs/2510.20817)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.External Links:2103\.03874,[Link](https://arxiv.org/abs/2103.03874)Cited by:[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px1.p2.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2021\)LoRA: low\-rank adaptation of large language models\.External Links:2106\.09685,[Link](https://arxiv.org/abs/2106.09685)Cited by:[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px1.p4.2)\.
- E\. Hughes, M\. Dennis, J\. Parker\-Holder, F\. Behbahani, A\. Mavalankar, Y\. Shi, T\. Schaul, and T\. Rocktaschel \(2024\)Open\-endedness is essential for artificial superhuman intelligence\.External Links:2406\.04268,[Link](https://arxiv.org/abs/2406.04268)Cited by:[§7](https://arxiv.org/html/2606.18284#S7.SS0.SSS0.Px3.p2.1)\.
- N\. Jain, J\. Singh, M\. Shetty, L\. Zheng, K\. Sen, and I\. Stoica \(2025\)R2E\-gym: procedural environments and hybrid verifiers for scaling open\-weights swe agents\.arXiv preprint arXiv:2504\.07164\.Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[§6](https://arxiv.org/html/2606.18284#S6.SS0.SSS0.Px4.p1.7)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. R\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world github issues?\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.18284#S1.p2.1)\.
- T\. Kwa, D\. Thomas, and A\. Garriga\-Alonso \(2024\)Catastrophic goodhart: regularizing rlhf with kl divergence does not mitigate heavy\-tailed reward misspecification\.External Links:2407\.14503,[Link](https://arxiv.org/abs/2407.14503)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px3.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.External Links:[Link](https://arxiv.org/abs/2309.06180)Cited by:[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px2.p3.4)\.
- V\. Lacombe, V\. Quesnel, and D\. Sileo \(2025\)Reasoning core: a scalable rl environment for llm symbolic reasoning\.External Links:2509\.18083,[Link](https://arxiv.org/abs/2509.18083)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
- N\. Lambert \(2025\)Reinforcement learning from human feedback\.arXiv preprint arXiv:2504\.12501\.Cited by:[§1](https://arxiv.org/html/2606.18284#S1.p1.1)\.
- J\. Li, H\. Lin, H\. Lu, K\. Wen, Z\. Yang, J\. Gao, Y\. Wu, and J\. Zhang \(2025\)QuestA: expanding reasoning capacity in llms via question augmentation\.External Links:2507\.13266,[Link](https://arxiv.org/abs/2507.13266)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
- X\. Liang, Z\. Li, Y\. Gong, Y\. Wang, H\. Zhang, Y\. Shen, Y\. N\. Wu, and W\. Chen \(2025a\)SwS: self\-aware weakness\-driven problem synthesis in reinforcement learning for llm reasoning\.External Links:2506\.08989,[Link](https://arxiv.org/abs/2506.08989)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
- Z\. Liang, R\. Li, Y\. Zhou, L\. Song, D\. Yu, X\. Du, H\. Mi, and D\. Yu \(2025b\)CLUE: non\-parametric verification from experience via hidden\-state clustering\.External Links:2510\.01591,[Link](https://arxiv.org/abs/2510.01591)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Liu, G\. Li, J\. Li, H\. Zhu, K\. Zhang, and Y\. Dong \(2025a\)SATURN: sat\-based reinforcement learning to unleash llms reasoning\.External Links:2505\.16368,[Link](https://arxiv.org/abs/2505.16368)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
- M\. Liu, S\. Diao, X\. Lu, J\. Hu, X\. Dong, Y\. Choi, J\. Kautz, and Y\. Dong \(2025b\)Prorl: prolonged reinforcement learning expands reasoning boundaries in large language models\.arXiv preprint arXiv:2505\.24864\.Cited by:[§1](https://arxiv.org/html/2606.18284#S1.p1.1)\.
- E\. Montahaei, D\. Alihosseini, and M\. S\. Baghshah \(2019\)Jointly measuring diversity and quality in text generation models\.External Links:1904\.03971,[Link](https://arxiv.org/abs/1904.03971)Cited by:[§4\.1](https://arxiv.org/html/2606.18284#S4.SS1.p1.3)\.
- T\. Moskovitz, A\. K\. Singh, D\. Strouse, T\. Sandholm, R\. Salakhutdinov, A\. D\. Dragan, and S\. McAleer \(2023\)Confronting reward model overoptimization with constrained rlhf\.External Links:2310\.04373,[Link](https://arxiv.org/abs/2310.04373)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Orgad, M\. Toker, Z\. Gekhman, R\. Reichart, I\. Szpektor, H\. Kotek, and Y\. Belinkov \(2024\)Llms know more than they show: on the intrinsic representation of llm hallucinations\.arXiv preprint arXiv:2410\.02707\.Cited by:[§1](https://arxiv.org/html/2606.18284#S1.p3.1)\.
- J\. Pan, X\. Wang, G\. Neubig, N\. Jaitly, H\. Ji, A\. Suhr, and Y\. Zhang \(2024\)Training software engineering agents and verifiers with swe\-gym\.arXiv preprint arXiv:2412\.21139\.Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
- A\. V\. Prasad, C\. Watts, J\. Merullo, D\. Gala, O\. Lewis, T\. McGrath, and E\. S\. Lubana \(2026\)Features as rewards: scalable supervision for open\-ended tasks via interpretability\.External Links:2602\.10067,[Link](https://arxiv.org/abs/2602.10067)Cited by:[§1](https://arxiv.org/html/2606.18284#S1.p3.1),[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px2.p1.1),[§3\.3](https://arxiv.org/html/2606.18284#S3.SS3.SSS0.Px3.p1.11)\.
- Y\. Qu, A\. Singh, Y\. Lee, A\. Setlur, R\. Salakhutdinov, C\. Finn, and A\. Kumar \(2025\)RLAD: training llms to discover abstractions for solving reasoning problems\.External Links:2510\.02263,[Link](https://arxiv.org/abs/2510.02263)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
- Qwen Team \(2026\)Qwen3\.5: towards native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px2.p1.1)\.
- Qwen, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px1.p2.1)\.
- B\. Radharapu, E\. Saxena, K\. Li, C\. Whitehouse, A\. Williams, and N\. Cancedda \(2025\)Calibrating llm judges: linear probes for fast and reliable uncertainty estimation\.External Links:2512\.22245,[Link](https://arxiv.org/abs/2512.22245)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[Table 12](https://arxiv.org/html/2606.18284#S16.T12.74.70.78.8.2),[§3\.3](https://arxiv.org/html/2606.18284#S3.SS3.SSS0.Px3.p1.11)\.
- A\. Sonwane, I\. White, H\. Lee, M\. Pereira, L\. Caccia, M\. Kim, Z\. Shi, C\. Singh, A\. Sordoni, M\. Côté, and X\. Yuan \(2025\)BugPilot: complex bug generation for efficient learning of swe skills\.External Links:2510\.19898,[Link](https://arxiv.org/abs/2510.19898)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
- H\. Wang, Z\. Hou, Y\. Wei, J\. Tang, and Y\. Dong \(2025\)SWE\-dev: building software engineering agents with training and inference scaling\.External Links:2506\.07636,[Link](https://arxiv.org/abs/2506.07636)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
- Y\. Wei, Z\. Sun, E\. McMilin, J\. Gehring, D\. Zhang, G\. Synnaeve, D\. Fried, L\. Zhang, and S\. Wang \(2025\)Toward training superintelligent software agents through self\-play swe\-rl\.arXiv preprint arXiv:2512\.18552\.Cited by:[§1](https://arxiv.org/html/2606.18284#S1.p1.1),[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3),[§3\.2](https://arxiv.org/html/2606.18284#S3.SS2.p1.16)\.
- L\. Wolf, R\. Kirk, and M\. Musolesi \(2025\)Reward model overoptimisation in iterated rlhf\.External Links:2505\.18126,[Link](https://arxiv.org/abs/2505.18126)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Xie, E\. Liu, G\. Zhang, N\. Kotalwar, S\. Gandhi, S\. Acharya, X\. Wang, C\. Rose, G\. Neubig, and D\. Fried \(2026\)Hybrid\-gym: training coding agents to generalize across tasks\.arXiv preprint arXiv:2602\.16819\.Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
- J\. Yang, K\. Lieret, C\. E\. Jimenez, A\. Wettig, K\. Khandpur, Y\. Zhang, B\. Hui, O\. Press, L\. Schmidt, and D\. Yang \(2026\)SWE\-smith: scaling data for software engineering agents\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=63iVrXc8cC)Cited by:[§1](https://arxiv.org/html/2606.18284#S1.p2.1),[§13\.1](https://arxiv.org/html/2606.18284#S13.SS1.p1.1),[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px2.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, YuYue, W\. Dai, T\. Fan, G\. Liu, J\. Liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, R\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, Y\. Wu, and M\. Wang \(2025\)DAPO: an open\-source LLM reinforcement learning system at scale\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=2a36EMSSTp)Cited by:[§3\.3](https://arxiv.org/html/2606.18284#S3.SS3.SSS0.Px3.p1.11)\.
- A\. Zhang, Y\. Chen, J\. Pan, C\. Zhao, A\. Panda, J\. Li, and H\. He \(2025a\)Reasoning models know when they’re right: probing hidden states for self\-verification\.External Links:2504\.05419,[Link](https://arxiv.org/abs/2504.05419)Cited by:[§1](https://arxiv.org/html/2606.18284#S1.p3.1),[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Zhang, S\. Yu, D\. Chong, A\. Sicilia, M\. R\. Tomz, C\. D\. Manning, and W\. Shi \(2025b\)Verbalized sampling: how to mitigate mode collapse and unlock llm diversity\.External Links:2510\.01171,[Link](https://arxiv.org/abs/2510.01171)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Zhang, J\. Yang, M\. Yang, J\. Yang, M\. Chen, J\. Zhang, Z\. Cui, B\. Hui, and J\. Lin \(2025c\)Swe\-flow: synthesizing software engineering data in a test\-driven manner\.arXiv preprint arXiv:2506\.09003\.Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
- N\. Zhang, W\. Ma, Z\. Ma, J\. Xu, J\. Gao, J\. Hao, R\. He, and J\. Xu \(2026\)Silence the judge: reinforcement learning with self\-verifier via latent geometric clustering\.External Links:2601\.08427,[Link](https://arxiv.org/abs/2601.08427)Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Zhao, Y\. Wu, Y\. Yue, T\. Wu, Q\. Xu, Y\. Yue, M\. Lin, S\. Wang, Q\. Wu, Z\. Zheng, and G\. Huang \(2025\)Absolute zero: reinforced self\-play reasoning with zero data\.External Links:2505\.03335,[Link](https://arxiv.org/abs/2505.03335)Cited by:[§1](https://arxiv.org/html/2606.18284#S1.p1.1),[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3),[§4\.2](https://arxiv.org/html/2606.18284#S4.SS2.SSS0.Px1.p3.2)\.
- Y\. Zhu, A\. Gandhi, and G\. Neubig \(2025\)Training versatile coding agents in synthetic environments\.arXiv preprint arXiv:2512\.12216\.Cited by:[§2](https://arxiv.org/html/2606.18284#S2.SS0.SSS0.Px1.p1.3)\.
\\beginappendix
## 8SWE RL training results
The selected probe \(Table[11](https://arxiv.org/html/2606.18284#S14.T11)\) drives the generator\-RL run summarized by theSWEblock in Table[4](https://arxiv.org/html/2606.18284#S6.T4); full hyperparameters are listed in Appendix[16](https://arxiv.org/html/2606.18284#S16)\. Figure[5](https://arxiv.org/html/2606.18284#S8.F5)plots the per\-step mean across rollouts of the final reward \(validity\-gated probe logit with the invalid\-reward floor for non\-valid bugs\)\. The reward rises from0\.040\.04at the first step to0\.350\.35at the final training step\.
Figure 5:SWE generator\-RL training curves: per\-step mean of the final reward \(validity\-gated probe logit with an invalid\-reward floor of−0\.2\-0\.2\)\. The reward rises from0\.040\.04at the first step to0\.350\.35at the final selected checkpoint\.
## 9Where the SWE OOD gain comes from
The gain of PROPEL over the base generator on the OOD set is\+9\.8\+9\.8percentage points, roughly twice the in\-distribution gain\. The base OOD frontier rate \(9\.8%9\.8\\%\) is also well below the base in\-distribution rate \(21\.8%21\.8\\%\)\. Both effects are explained by the difficulty distribution of the bugs the generator introduces on unfamiliar repositories\. Table[5](https://arxiv.org/html/2606.18284#S9.T5)breaks down the per\-bug solver outcome on the≥2\\geq\\\!2\-valid\-trial subset of each generator\. On the OOD set the base generator produces72\.1%72\.1\\%of valid bugs in the trivially\-solved bucket, versus61\.4%61\.4\\%on the in\-distribution set\. In contrast PROPEL reduces both trivial\-solve rates \(18\.4%18\.4\\%in\-dist\.,19\.6%19\.6\\%OOD\) and instead concentrates mass at the “some but not all trials solve” target band that the probe was trained to reward\.
Table 5:Per\-bug solver\-outcome breakdown by arm, restricted to bugs with≥2\\geq\\\!2valid solver trials \(the denominator used in the main “Utility” column for SWE\)\.*Trivial*\(3/33/3or2/22/2\) and*too hard*\(0/20/2,0/30/3\) flank the target band \(11or22solves out of≥2\\geq\\\!2valid trials\)\.Diff size and edit shape are unchanged across regimes: in the3/33/3subset of the trivial\-solve bucket, base on the OOD set produces13\.513\.5\-line diffs \(median1313\) with mean1\.21\.2inserted and1\.21\.2deleted lines, versus14\.314\.3\-line diffs \(median1313\) with1\.4/1\.41\.4/1\.4insertions/deletions on the in\-distribution set\. The base generator introduces structurally identical one\-line edits in both regimes; OOD bugs are just solved more often, consistent with shallow edits landing on less\-critical control flow in unfamiliar codebases\. The\+9\.8\+9\.8pp OOD gain therefore reflects RL teaching the generator to land its single\-line edit on a behavior\-relevant location even when the source repository is novel\.
## 10Additional Analysis on SWE
Table 6:Verifier failure surface for base vs\. RL\-generated bugs\. Rows include only usable bugs under the≥2\\geq 2valid\-solver\-trial filter\. Failing tests is the number of verifier tests that fail after applying the generated bug patch\. Because base and RL are evaluated on the same locked repository splits, the absolute failing\-test counts are test\-suite dependent but comparable within each split\.Beyond increasing the number of usable bugs, RL shifts the generated\-bug distribution toward broader verifier failure surfaces\. On the in\-distribution RL\-Train split, the median number of failing tests per usable bug increases from11under the base generator to55under the RL generator; the fraction of bugs failing at least 10 tests increases from11\.911\.9% to28\.928\.9%\. On the Held\-Out Eval split, the median likewise increases from22to55, and the fraction failing at least1010tests increases from18\.018\.0% to30\.430\.4%\.
Table[7](https://arxiv.org/html/2606.18284#S10.T7)shows that mean trajectory length drops from26\.826\.8to15\.115\.1generator turns over3030steps while quality signals rise\. Patch\-validity climbs from0\.580\.58to0\.970\.97, bug\-validity from0\.320\.32to0\.790\.79, and the shaped task reward from0\.280\.28to0\.680\.68\. The shift is also visible in the reasons for stopping: clean post\-submission terminations \(stop\) grow from151151to247247of256256episodes, while turn\-cap exhaustions \(max\_turns\) and single\-response context overflows \(length\) collapse from6161and4444to22and77respectively\.
Table 7:Trajectory length collapses from 26\.8 to 15\.1 mean turns while reward and bug\-validity rise\. Stop\-reason counts are out of 256 episodes per step:stop:= clean termination after submission,max\_turns:= hit the 35\-turn cap,length:= single\-response context overflow \(32k tokens\)\. The episodes shorten because the agent completes tasks faster rather than terminating early:stopcounts rise whilemax\_turnsandlengthterminations fall\.
## 11Solve\-count distributions
The subfigures in Figure[2](https://arxiv.org/html/2606.18284#S1.F2)report the aggregate11–3@83@8solve rate per \(domain, target\-solver\) cell\. The underlying solve\-count distribution is shown in Figure[6](https://arxiv.org/html/2606.18284#S11.F6)\(PROPEL and PROPEL WCO vs\. base policy\)\. The proximal mechanism behind the main lift is the redistribution of mass from the8/88/8bin \(saturated\) into the11–3@83@8band the probe is trained against\.
Figure 6:Solve\-count distribution PROPEL vs\. base policy per \(domain, target\-solver\) cell and PROPEL WCO for the AZR domain\. The shaded band marks the11–3@83@8band the probe is trained against\.
## 12AZR 3B adversarial coevolution
The AZR single\-probe runs exhibit a standard proxy\-optimization failure: the generator learns to produce tasks that score well under the probe, but many of those tasks come from the same semantic family\. Adversarial co\-evolution addresses this by mining the probe’s false positives\. A false positive is a valid task that the probe scores as useful but whose actual target\-solver solve count falls outside the target band\. We treat these tasks as negatives, train an auxiliary probe on them, and fold it back into the next generator\-RL run as a constraint\.
Concretely, positive examples are the original probe\-training corpus \(valid tasks solved in11,22, or33of88attempts\); negatives are the false positives just defined plus valid tasks from the dominant collapsed family discovered after single\-probe PROPEL\. The original probe continues to supply the utility signal during RL; the auxiliary probe acts as a conservative constraint, replacing the reward on valid tasks with the minimum of the two probe scores\.
Using the auxiliary probe in isolation produces a different failure mode\. In a standalone run the generator avoids the original collapsed family but moves to very easy string\-manipulation tasks \(99\.7%99\.7\\%validity,1\.8%1\.8\\%Utility,88\.6%88\.6\\%of generations solved on all88target\-solver trials, valid\-only top\-topic share99\.9%99\.9\\%\)\. The auxiliary probe is therefore used as a constraint on top of the utility objective rather than as a replacement\.
Table[8](https://arxiv.org/html/2606.18284#S12.T8)reports the follow\-up under the same Utility, Valid, Self\-BLEU\-3, Distinct\-3, and Top\-topic metrics used in the main AZR evaluation\. The false\-positive variant raises Utility over base \(0\.1009→0\.15330\.1009\\to 0\.1533\) and Valid over base \(0\.6654→0\.83980\.6654\\to 0\.8398\) while keeping the diversity metrics close to base, and recovers most of the unconstrained single\-probe run’s Utility \(0\.19950\.1995\) without its semantic concentration\.
Table 8:AZR 3B adversarial coevolution as a reward\-hacking mitigation\. Metrics match Table[4](https://arxiv.org/html/2606.18284#S6.T4): Utility is generated\-denominator11–3@83@8frontier rate, Valid is sandbox validity, Self\-BLEU\-3 is lower\-is\-better repetition, Distinct\-3 is higher\-is\-better trigram diversity, and Top\-topic is the share of generations in the most common semantic family\. Base and single\-probe rows average three seeds\. Adversarial rows are a first follow\-up sweep: one RL seed per variant\.
## 13SWE Data Pipeline
### 13\.1Repository sourcing
All repositories used in this work are drawn from theSWE\-smithPython profile registry\(Yanget al\.,[2026](https://arxiv.org/html/2606.18284#bib.bib34)\), which provides for each repository a designated buggy commit, a pre\-built Docker sandbox image, a default test runner, and the corresponding test framework\. From this registry we apply a four\-stage filter to obtain a tractable, non\-flaky working set:
1. 1\.Hard\-gate exclusions\.We exclude three repositories \(paramiko,sqlfluff,autograd\) known to cause network/I/O flakiness, prohibitively long test suites, or environment build failures\.
2. 2\.Preflight quality filter\.A pre\-pass rus every remaining SWE\-smith repository in its sandbox once on the unmodified commit and records build status, test count, baseline test durnation, and baseline test failures\. Repositories pass ifstatus = "ok",tests\>0\\text\{tests\}\>0, andbaseline\_failures≤1\\text\{baseline\\\_failures\}\\leq 1\. The held\-out RL\-Train and Held\-Out Eval selections allow a small number of slower high\-coverage repositories to preserve split breadth\.
3. 3\.Quality ranking\.Surviving repositories are ranked by*tests\-per\-entity*\(a coverage proxy\) and the top\-NNare taken, whereNNis split\-specific\.
4. 4\.Empirical yield filter\.After running the bug\-generation agent and the K=3 solver labeling pipeline, repositories with no trajectories satisfying our usability criteria \(bug\_valid≥1\\geq 1and at least one valid solver trial; see Sections[13\.3](https://arxiv.org/html/2606.18284#S13.SS3)–[13\.4](https://arxiv.org/html/2606.18284#S13.SS4)\) are dropped from the labeled dataset\.
### 13\.2Splits
We construct three repo\-disjoint splits, summarized in Table[9](https://arxiv.org/html/2606.18284#S13.T9)\.
Table 9:Labeled\-dataset splits\. Repos and Eps are attempted split totals; Bugs are usableK=3K\{=\}3rows under two filters \(≥1\\geq 1valid solver trial vs\.≥2\\geq 2valid solver trials\); zero\-yield repositories remain in the rollout denominator but contribute no rows\. Yield is Bugs \(≥2\\geq 2\) / Eps\. Valid % is trial\-level validity \(fraction ofK⋅NK\\cdot Nslots with a non\-null reward\); Pooledr¯\\bar\{r\}is solved trials / valid trials over non\-null slots\.#### Why≥2\\geq 2valid trials\.
The probe\-training filter requires at least two valid solver trials per bug \(out ofK=3K\{=\}3\)\. A single trial gives a noisy solve\-rate estimate \(r^∈\{0,1\}\\hat\{r\}\\in\\\{0,1\\\}, variance0\.250\.25at truer=0\.5r=0\.5\); two trials halve that variance and allowr^∈\{0,0\.5,1\}\\hat\{r\}\\in\\\{0,0\.5,1\\\}\. The filter retains73\.2%73\.2\\%of Probe\-Train rows \(1,757/2,3991\{,\}757/2\{,\}399\),78\.0%78\.0\\%of RL\-Train \(85/10985/109\), and84\.5%84\.5\\%of the usable Held\-Out Eval rows \(49/5849/58\), while removing the noisiest single\-trial bugs from training\. We report main results under both filters but use≥2\\geq 2for all probe training and held\-out scoring\.
The roles align with our experimental protocol:Probe\-Trainis used for cross\-repository cross\-validation when training the probe;RL\-Trainprovides cross\-repository held\-out evaluation of probe quality and is also the repository pool against which the generator is RL\-trained;Held\-Out Evalis the strict held\-out generalization set for the post\-RL generator and supplies bugs for downstream solver fine\-tuning\. Repositories never overlap across splits\.
### 13\.3Bug generation \(generator\)
Bug validity \(bug\_valid\) follows the criteria in Section[3](https://arxiv.org/html/2606.18284#S3); the verifier replays the agent’s diff viagit apply \-\-recountbefore running the test suite\. We run 50 episodes per repository for RL\-Train and Held\-Out Eval, and up to 280 episodes per repository for the Probe\-Train pool\. The Probe\-Train pool is filtered tobug\_valid = 1under the validity check\. The offline bug\-generation episodes for the RL\-Train and Held\-Out Eval splits use max\-turns 25, temperature0\.70\.7, max\-tokens20482048, top\_p0\.950\.95, and top\_k2020; the RL training rollouts themselves use the larger 35\-turn cap \(Table[12](https://arxiv.org/html/2606.18284#S16.T12)\)\.
We define the end\-to\-end*yield*of a repository as the fraction of attempted generator episodes that pass the bug\-validity gate and yield at least two valid solver trials\. Per\-repository yields range from0%0\\%to91%91\\%; the dataset\-wide yields are29\.8%29\.8\\%,15\.5%15\.5\\%, and8\.9%8\.9\\%for Probe\-Train, RL\-Train, and the full Held\-Out Eval split respectively\.
The generator receives the system and instruction prompt shown in Appendix[18](https://arxiv.org/html/2606.18284#S18)\. The full prompt is bash\-only\. Thebrowse\_indexreference in the prompt body refers to a small helper script the generator can invoke*from the shell*\(e\.g\.bash \-c "python /workspace/browse\_index\.py"\), which enables easier repository navigation\.
### 13\.4Solver labeling
Each of theK=3K=3OpenCode solver trials \(Section[3](https://arxiv.org/html/2606.18284#S3)\) uses three tools \(browse\_index,show,edit\) and runs in a fresh Docker sandbox with a1,8001\{,\}800s wall\-clock budget and 2 CPU / 4 GB / 10 GB resource limits\. A trial returns reward∈\{0,1\}\\in\\\{0,1\\\}if the verifier ran \(the solver submitted a solution patch within the budget\), elsenull\(timeout, setup failure, or runtime error\)\. The default sampling parameters used for all label runs areT=0\.6T=0\.6, top\_p=0\.95=0\.95, top\_k=20=20\(the values loaded by vLLM from Qwen3\.5\-27B’sgeneration\_config\.json\)\.
## 14Probe Training and Selection Details
The main probe\-training summary reports the dataset sizes and frozen probes used by RL\. This appendix gives the implementation details behind those choices: activation extraction, the sweep grid, optimizer settings, and the selection rule\.
#### Activation extraction and candidate grid\.
Activations are read from the frozen reference model\. For math and AZR, each generated task is a single rendered output and is represented by one pooled activation vector\. For SWE, each task is a multi\-turn agent trajectory, so we extract one activation vector per generated assistant turn and pool across the trajectory\. Math and AZR cover an8×3×2=488\\times 3\\times 2=48\-cell grid per \(domain, solver\)\. SWE covers a larger5×9×3×2=2705\\times 9\\times 3\\times 2=270\-cell grid, with more trajectory poolings, an additional probe architecture, and class\-imbalance handling\.
Table 10:Probe sweep grid and training configuration\. The MLP head usesd→512→128→2d\\\!\\to\\\!512\\\!\\to\\\!128\\\!\\to\\\!2with dropout0\.30\.3\. SWE additionally sweeps trajectory\-level pooling, a third \(deep\-MLP\) architecture, and balance strategy, giving the larger270270\-cell grid\.Math / AZRSWEReference modelQwen/Qwen3\.5\-4BQwen/Qwen3\.5\-27BLayersLL\{3,7,11,15,19,23,27,31\}\\\{3,7,11,15,19,23,27,31\\\}\{5,15,31,47,63\}\\\{5,15,31,47,63\\\}Poolingslast\_token,mean\_full,last\_token,mean\_full,mean\_last\_50mean\_last\_\{50,5,3\},mean\_last\_half,max,first\_last\_concat,mean\_std\_concatArchitectureslinear, MLPlinear, MLP, deep MLPBalancedownsampledownsample*or*weighted lossSplit80/10/10 train/val/test5\-fold cross\-repositoryMax sequence length2,0482\{,\}048\(math\),4,0964\{,\}096\(AZR\)32,76832\{,\}768Cells per sweep4848270270
#### Training\.
All probe heads are trained with AdamW at learning rate10−310^\{\-3\}, weight decay10−410^\{\-4\}, batch size6464, and up to5050epochs with early\-stopping patience77\. Math/AZR probes are fit on downsampled balanced datasets\. SWE keeps all Probe\-Train rows under the weighted\-loss setting because the positive class is sparse\.
#### Selection\.
For each \(domain, target solver\), we first rank sweep cells by held\-out balanced accuracy subject to calibration\. For Math/AZR, we then select the probe by reward variance under policy \(RVP\) on a fixed sample ofn=512n\{=\}512base\-policy completions\. This two\-stage rule reflects the empirical observation that an accurate but nearly constant on\-policy probe gives little useful gradient signal during RL, while a slightly less accurate probe with higher RVP can produce a stronger downstream utility lift\.
For SWE, the random 80/10/10 split is replaced by 5\-fold cross\-repository validation\. Each fold holds out entire repositories, testing whether the probe transfers beyond repositories seen during probe training rather than merely to unseen bugs from familiar repositories\. The selected SWE probe is the configuration with the best cross\-fold balanced accuracy under weighted\-loss training, averaged over33random seeds\. On the cross\-repository held\-out fold, the Spearman correlation between probe score and ground\-truth solver solve\-rate isρ=0\.418\\rho=0\.418\.
### 14\.1Probe\-Sweep Results AZR/Math
Figures[7](https://arxiv.org/html/2606.18284#S14.F7)and[8](https://arxiv.org/html/2606.18284#S14.F8)render the full8×3×28\\times 3\\times 2sweep grids as validation\-balanced\-accuracy heatmaps for the Math and AZR domains under the11–33/8 label scheme, faceted by architecture \(rows\) and target\-solver size \(columns\)\.
Figure 7:Probe\-sweep validation balanced accuracy heatmap on the Math domain \(1\-3@8 label scheme\), faceted by architecture \(rows\) and solver size \(columns\)\.Figure 8:Probe\-sweep validation balanced accuracy heatmap on the AZR domain \(1\-3@8 label scheme\)\. Same layout as Figure[7](https://arxiv.org/html/2606.18284#S14.F7)\.
### 14\.2Probe Sweep Results SWE
We run a complete probe\-sweep grid for the SWE domain on the Probe\-Train split \(Appendix[13](https://arxiv.org/html/2606.18284#S13)\), training each cell three times with seeds\{0,1,2\}\\\{0,1,2\\\}and averaging\.
#### Target\.
The probe is trained on the solver utility signalUSU\_\{S\}\(Section[3\.2](https://arxiv.org/html/2606.18284#S3.SS2)\) withS=Qwen3\.5\-27BS=\\texttt\{Qwen3\.5\-27B\}atK=3K=3solver trials, restricted to bugs with≥2\\geq 2valid solver trials: positives are bugs whose realized solve\-rate is in\(0,1\)\(0,1\)\(i\.e\. exactly1/31/3,2/32/3, or1/21/2when one of three solver trials is null\); negatives are bugs that either fail every valid trial \(0/20/2,0/30/3\) or succeed on every valid trial \(2/22/2,3/33/3\)\. On Probe\-Train this gives362362positives and1,3951\{,\}395negatives \(N=1,757N=1\{,\}757across4545repositories\)\.
#### Grid\.
The full270270\-cell SWE sweep \(5×9×3×25\\times 9\\times 3\\times 2: layer×\\timestrajectory pooling×\\timesarchitecture×\\timesclass\-imbalance strategy\) and the shared training hyperparameters are listed in Table[10](https://arxiv.org/html/2606.18284#S14.T10)\. Epoch selection per cell is by cross\-validated balanced accuracy on the 5\-fold cross\-repository split\.
#### Selected configuration\.
Selecting on cross\-repo balanced accuracy under weighted\-loss training yields the configuration in Table[11](https://arxiv.org/html/2606.18284#S14.T11)\.
Table 11:Selected SWE probe configuration, averaged over33random seeds\. Held\-out Spearmanρ\\rhois computed on the cross\-repo held\-out fold \(Probe\-Train internal CV\); CV metrics are mean across the 5 cross\-repo folds\. The held\-outρ\\rhois high\-variance per seed\.
## 15Probe Selection: RVP Validation Panel
Section[3](https://arxiv.org/html/2606.18284#S3)motivates RVP as a complementary diagnostic to held\-out balanced accuracy and calibration, on the grounds that a probe with near\-constant on\-policy reward carries no useful gradient during RL\. We report a1212\-probe panel \(6 Math \+ 6 AZR across two target\-solver sizes\), pairing each probe with a matched short\-RL run\. RVP is computed once before RL on a fixed sample fromπref\\pi\_\{\\mathrm\{ref\}\}\. Pooled, RVP and the realized short\-RL reward gain are positively associated \(Pearsonr=0\.735r=0\.735,p=0\.006p=0\.006; Figure[9](https://arxiv.org/html/2606.18284#S15.F9)\), but the effect is largely cross\-domain: within Math the trend is clean \(r=0\.95r=0\.95,n=6n=6\); within AZR the candidates cluster in a narrow high\-RVP band \(0\.110\.11–0\.140\.14\) where RVP no longer discriminates \(r=−0\.20r=\-0\.20,n=6n=6\)\. We therefore use RVP as a coarse sanity check against very low on\-policy variance rather than a fine within\-domain selector, alongside balanced accuracy and calibration when selecting each probe\.
Figure 9:RVP vs short\-RL reward gain across the1212\-probe panel\. Each point is one probe; color marks the domain\. Dashed line is the OLS fit \(pooled Pearsonr=0\.735r=0\.735,p=0\.006p=0\.006,n=12n=12\)\.
## 16RL Hyperparameters
Table[12](https://arxiv.org/html/2606.18284#S16.T12)provides the generator\-RL hyperparameters for all three domains\. Math and AZR share a single training recipe launched withaccelerateon44GPUs per run; SWE uses the SkyRL stack across 3 nodes \(2424H100s total\) with FSDP\. The KL coefficientβ\\beta, group sizeGG, and selected checkpoint were screened on a per\-cell grid using validity, diversity \(Self\-BLEU\-3, Distinct\-3, top\-topic\), and collapse alerts\. The KL and reward\-composition screening sweeps are reported in Section[6](https://arxiv.org/html/2606.18284#S6)\(Figures[4\(a\)](https://arxiv.org/html/2606.18284#S6.F4.sf1)and[4\(b\)](https://arxiv.org/html/2606.18284#S6.F4.sf2)\)\.
Table 12:Generator\-RL configuration for the confirmatory multi\-seed runs\. The same training recipe is used for both standard PROPEL \(RhardR\_\{\\text\{hard\}\}\) and PROPEL with worst\-case ensemble \(RWCOR\_\{\\text\{WCO\}\}\) variants; only the reward gate differs\.Math / AZRSWEGeneratorBase modelQwen3\.5\-4BQwen3\.5\-27BTrainable parametersLoRA adaptersfull fine\-tuneLoRA rankrr1616—LoRAα\\alpha3232—LoRA dropout0\.050\.05—LoRA target modulesall 7 attn\./MLP projections—Mixed precisionbf16bf16OptimizerAlgorithmGRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18284#bib.bib142)\)DAPO \(SkyRL\)OptimizerAdamWAdamWLearning rate5×10−55\\\!\\times\\\!10^\{\-5\}1×10−61\\\!\\times\\\!10^\{\-6\}Weight decay0\.00\.0\(default\)0\.010\.01Adam\(β1,β2\)\(\\beta\_\{1\},\\beta\_\{2\}\)\(0\.9,0\.999\)\(0\.9,0\.999\)\(0\.9,0\.999\)\(0\.9,0\.999\)Max gradient norm1\.01\.01\.01\.0Sampling & rolloutsTemperature0\.90\.90\.70\.7Top\-pp0\.950\.950\.950\.95Top\-kk—2020Group sizeGG\(rollouts/prompt\)441616Max generation length512512\(Math\) /10241024\(AZR\)20482048tokens / turnMax input/context length40964096tokens32,76832\{,\}768tokensMax multi\-turn rollout depth11\(single\-shot\)3535turnsGRPO loopPer\-device train batch \(generations\)881616Gradient accumulation steps8811Effective rollouts per step8×4×8=2568\\times 4\\times 8=25616×16=25616\\times 16=256PPO inner iterations11\(on\-policy\)11\(on\-policy\)Advantage & policy\-lossAdvantage estimatorTRLGRPOTraineradvantage\_estimator: loopPolicy loss typeTRL defaultDr\.GRPO\-style constant\-length scalingImportance\-ratio clipε\\varepsilonTRL defaultdual\-clip,εlow=0\.20\\varepsilon\_\{\\text\{low\}\}\{=\}0\.20,εhigh=0\.28\\varepsilon\_\{\\text\{high\}\}\{=\}0\.28KL anchor in lossyes \(β⋅KL\[πθ∥πref\]\\beta\\cdot\\mathrm\{KL\}\[\\pi\_\{\\theta\}\\\|\\pi\_\{\\mathrm\{ref\}\}\]\)off \(use\_kl\_loss=false\)KL anchor in rewardnooff \(use\_kl\_in\_reward=false\)KL coefficientβ\\beta0\.050\.05\(AZR\),0\.100\.10\(Math\)—Reward shapingReward compositionRhardR\_\{\\text\{hard\}\}/RWCOR\_\{\\text\{WCO\}\}RhardR\_\{\\text\{hard\}\}Invalid\-reward floorrbadr\_\{\\text\{bad\}\}−0\.2\-0\.2−0\.2\-0\.2Validity gate𝒲\(x\)\\mathcal\{W\}\(x\)sandbox\-runnable / oracle \(math\)verifier \(bug\_valid≥\\geqthreshold\)Bug\-valid threshold \(SWE\)—1\.01\.0Context\-overflow reward—0\.00\.0Probe gate behavioralwaysgated onbug\_validSchedule & selectionTotal GRPO steps50503030Checkpoint save intervalevery1010stepsevery55stepsComputeProbe inference workersco\-located with policy44Ray actors×2\\times 2GPUs eachMax sequence length4,0964\{,\}096tokens36,86436\{,\}864tokensParallelismaccelerate DDP,44GPUsFSDP, 3 nodes×8\\times 8H100sWall\-clock per run≈6\\approx 6h on4×4\{\\times\}H100≈14\.5\\approx 14\.5h on24×24\{\\times\}H100
## 17Qualitative examples: base vs RL on AZR
To make the frontier\-rate gain in the main results concrete we surface four representative AZR\-induction tasks from the confirmatory multi\-seed runs underlying Table[4](https://arxiv.org/html/2606.18284#S6.T4): two from the base generator that illustrate the failure modes the probe reward is shaping*against*\(one with0of88hidden tests passed, i\.e\. the generated function does not match the natural\-language message and the solver cannot recover it; one with88of88, a textbook “square the input” that the solver saturates trivially\), and two from the PROPEL trained generator that land in the11–3@83@8learnable\-frontier band the probe targets\. Each example is the shortest valid sample in its solve\-count bucket, drawn from the base evaluation run and the PROPEL checkpoint at step 30\.
#### Base policy — 0 of 8 solved \(unsolvable / mis\-specified\)\.
`AZR / 3B target / Base policy / eval seed 12101 / sample 946`
`Base policy — 8 of 8 solved \(trivial / saturated\)\. AZR / 3B target / Base policy / eval seed 12101 / sample 278 PROPEL — 2 of 8 solved \(learnable frontier, 1\-3@8 band\)\. AZR / 3B target / PROPEL \(step 30\) / RL seed 9101 / sample 867 PROPEL — 2 of 8 solved \(learnable frontier, 1\-3@8 band\)\. AZR / 3B target / PROPEL \(step 30\) / RL seed 9101 / sample 917 18 Generator prompts 18\.1 Math domain Math task generator: system \+ user template 18\.2 AZR function induction domain AZR/induction task generator: system \+ user template 18\.3 SWE domain SWE generator: system prompt \+ instruction`Similar Articles
@lateinteraction: Indeed. But the next breakthrough for a far more scalable RL paradigm than GRPO is already here: Train your self-teache…
Introduces Pedagogical RL, a new paradigm where models learn to be self-teachers by using privileged information to actively sample successful and easy-to-follow trajectories, achieving up to 40% relative gains over GRPO and on-policy distillation methods.
GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero
GRLO introduces a novel reinforcement learning post-training method that achieves strong generalization across multiple domains (math, code, etc.) from only 5K prompts and 22.7 GPU hours, significantly outperforming in-domain RLVR baselines in efficiency and data requirements.
ExpRL: Exploratory RL for LLM Mid-Training
ExpRL is a new RL-based mid-training method that uses human-written reference solutions as dense reward scaffolds (never shown to the policy) to improve LLM reasoning, achieving significant gains on hard math benchmarks like AIME-2026.
TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition
Proposes TD-Grokking, a training-time decomposition framework that recursively breaks down intractable zero-reward problems into verifiable subproblems, enabling LLMs to learn from failed trajectories. Outperforms vanilla GRPO and baselines on mathematical and medical reasoning tasks.
Building Fast & Accurate Agents with Prime-RL Post Training (22 minute read)
Ramp presents a case study on using reinforcement learning post-training to build Fast Ask, a specialized spreadsheet retrieval agent that improves accuracy and reduces latency compared to general-purpose models.