PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

arXiv cs.AI 05/19/26, 04:00 AM Papers
llm self-play reinforcement-learning lora population-based-training reasoning
Summary
PopuLoRA introduces a population-based asymmetric self-play framework for RLVR post-training of LLMs, where teacher and student LoRA adapters co-evolve to generate increasingly complex problems, overcoming the self-calibration limitation of single-agent self-play.
arXiv:2605.16727v1 Announce Type: new Abstract: We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross-evaluation between sub-populations replaces the self-calibration that limits single-agent self-play. A family of LoRA weight-space evolution operators (mutations and crossovers that produce same-rank population members in seconds) serves as the replacement step of a population-based training loop at 7B scale. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per-adapter compute-matched single-agent baseline. Where the single agent self-calibrates to generating easy problems it can reliably solve, the population enters a co-evolutionary arms race: teachers produce increasingly complex problems, student solve rates oscillate, and problem-space coverage keeps expanding throughout training. Despite lower training-time reward, the population mean outperforms the baseline on three code benchmarks (HumanEval+, MBPP+, LiveCodeBench) and seven math benchmarks (AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench), and even the weakest member of the population beats the baseline on aggregate.
Original Article
View Cached Full Text
Cached at: 05/19/26, 06:35 AM
# PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play
Source: [https://arxiv.org/html/2605.16727](https://arxiv.org/html/2605.16727)
Geoffrey BradwayLorenz WolfMaxwill LinAugustine N\. Mavor\-ParkerMatthew James SargentVmax

\(May 16, 2026\)

###### Abstract

We introducePopuLoRA, a population\-based asymmetric self\-play framework for reinforcement learning with verifiable rewards \(RLVR\) post\-training of LLMs\. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross\-evaluation between sub\-populations replaces the self\-calibration that limits single\-agent self\-play\. A family of LoRA weight\-space evolution operators \(mutations and crossovers that produce same\-rank population members in seconds\) serves as the replacement step of a population\-based training loop at 7B scale\. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per\-adapter compute\-matched single\-agent baseline\. Where the single agent self\-calibrates to generating easy problems it can reliably solve, the population enters a co\-evolutionary arms race: teachers produce increasingly complex problems, student solve rates oscillate, and problem\-space coverage keeps expanding throughout training\. Despite lower training\-time reward, the*population mean*outperforms the baseline on three code benchmarks \(HumanEval\+, MBPP\+, LiveCodeBench\) and seven math benchmarks \(AIME 24/25, AMC 23, MATH\-500, Minerva, GSM8K, OlympiadBench\), and even the weakest member of the population beats the baseline on aggregate\.

\\correspondence\\emails

roger, augustine, matthew

## 1Introduction

RL post\-training has become a dominant regime for specialising large language models, from RLHF\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.16727#bib.bib7)\)and DPO on human preference data through self\-play fine\-tuning\(Chenet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib38)\)and large\-scale reinforcement learning with verifiable rewards \(RLVR\)\(Guoet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib39); Lambertet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib40)\)\. The optimisation machinery has matured, but the supply of*problems*has not: most current systems depend on a hand\-curated task distribution whose scope, difficulty, and coverage must be chosen in advance\. The open question we study in this paper is how to generate the*curriculum itself*, the stream of problems the policy is trained on, without relying on human\-authored datasets, using only a programmatic verifier as the external signal\.

The most direct approach lets a single model propose its own problems and grade itself through the verifier, as in Absolute Zero Reasoner\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib28)\)or, more broadly, single\-model self\-play\(Chenet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib38); Kubaet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib46)\)\. The shared flaw is that the same network generates problems and \(implicitly, through its solve rate or judgement\) estimates their difficulty\. In this paper we provide empirical evidence that such single\-agent self\-generation*self\-calibrates*: the proposer converges to generating problems it can consistently produce in valid format and consistently solve, and the training distribution collapses onto a narrow band long before the base model’s capability is exhausted\. The fix is structural: make the judge a*different agent*from the proposer\. A growing line of work explores this asymmetry\(Chenet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib41); Janaet al\.,[2026](https://arxiv.org/html/2605.16727#bib.bib42); Liuet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib43); Sundaramet al\.,[2026](https://arxiv.org/html/2605.16727#bib.bib48); Duanet al\.,[2026](https://arxiv.org/html/2605.16727#bib.bib44); Tanet al\.,[2026](https://arxiv.org/html/2605.16727#bib.bib45); Yeet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib59); Huanget al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib60)\), but all are scalar in agent count \(≤\\leq3 agents total\), whether the roles share parameters or train separate networks\. We take the asymmetry further, building on the teacher\-proposes / student\-solves structure introduced bySukhbaataret al\.\([2018](https://arxiv.org/html/2605.16727#bib.bib2)\), and replace the single agent with co\-evolving*populations*of specialisedteachersandstudents\. Teachers are rewarded for producing problems that are hard for the particular student they face, students are rewarded by the verifier, and matchmaking across the two sub\-populations turns difficulty into a population\-level signal rather than a self\-estimate\. Population dynamics add a second layer of exploration on top of gradient updates: lineages diverge, members specialise, and a population\-based training \(PBT\)\(Jaderberget al\.,[2017](https://arxiv.org/html/2605.16727#bib.bib29)\)replacement step recombines what the gradient path has already discovered\.

Running many independent full\-parameter 7B models on a single node is expensive and memory\-constrained, particularly when each member must support both rollout inference and gradient updates in a shared training loop\. We therefore instantiate every population member as a LoRA adapter\(Huet al\.,[2022](https://arxiv.org/html/2605.16727#bib.bib30)\)over a shared frozen base, which collapses the population’s memory footprint to the sum of its adapter weights rather than a full copy of the base per member\. Classical PBT\(Jaderberget al\.,[2017](https://arxiv.org/html/2605.16727#bib.bib29)\)mutates members by copying one agent’s full weights onto another and perturbing; at 7B scale that copy is itself expensive\. Our second contribution is a set ofLoRA weight\-space evolution operators\(SVD\-structured, layer\-selective, and component\-masking mutations together with DARE\(Yuet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib31)\), TIES\-inspired\(Yadavet al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib32)\), and task\-arithmetic\(Ilharcoet al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib33)\)crossovers, in the spirit of evolutionary model merging\(Akibaet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib50)\)\) that produce same\-rank children in seconds without any retraining\. They serve as the replacement step of an*online*PBT loop, making a population\-of\-adapters regime that is inaccessible to prior LoRA composition work\(Huanget al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib57); Buehler and Buehler,[2024](https://arxiv.org/html/2605.16727#bib.bib58); Fenget al\.,[2025b](https://arxiv.org/html/2605.16727#bib.bib55); Zhanget al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib51)\)\. We evaluate in the Absolute Zero code\-reasoning setting, with a sandboxed Python executor as the verifier\.

CONTRIBUTIONS1\.Population\-based asymmetric self\-play for RLVR\.Prior work\(Chenet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib41); Janaet al\.,[2026](https://arxiv.org/html/2605.16727#bib.bib42); Liuet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib43); Sundaramet al\.,[2026](https://arxiv.org/html/2605.16727#bib.bib48); Duanet al\.,[2026](https://arxiv.org/html/2605.16727#bib.bib44); Tanet al\.,[2026](https://arxiv.org/html/2605.16727#bib.bib45); Yeet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib59); Huanget al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib60); Kubaet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib46)\)is scalar in agent count \(≤\\leq3 agents\)\. PopuLoRA replaces these with populations of teacher and student LoRA adapters coupled by TrueSkill\-weighted cross\-evaluation\.2\.LoRA weight\-space evolution operatorsas the PBT replacement step\. Mutations and crossovers produce rank\-matched children in seconds; closest is evolutionary model merging\(Akibaet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib50)\), applied*offline*rather than inside an online PBT loop\.3\.Empirical validationon held\-out code and math benchmarks\. The population outperforms a per\-adapter compute\-matched single\-agent baseline; diagnostics confirm it avoids the mode\-collapse we observe in the baseline\.

## 2Background and Related Work

#### RLVR and self\-generated curricula\.

RLVR\(Guoet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib39)\)replaces learned preference models with programmatic checkers\. AZR\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib28)\)takes this to its logical extreme: a single code LLM both proposes and solves its own problems, rewarded only by a sandboxed executor, with no external dataset\. Adjacent methods \(STaR\(Zelikmanet al\.,[2022](https://arxiv.org/html/2605.16727#bib.bib65)\), rStar\-Math\(Guanet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib64)\), Self\-Rewarding\(Yuanet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib63)\)\) also train on self\-generated data but rely on fixed problem sets or learned reward signals; only AZR treats the model as both proposer and verifier\-checked solver\. We evaluate PopuLoRA on the three code\-reasoning task types AZR defines:code\_i\(infer\-input\),code\_o\(infer\-output\), andcode\_f\(infer\-function\), all verified mechanically by the executor\.

#### Self\-play and asymmetric roles\.

Asymmetric self\-play\(Sukhbaataret al\.,[2018](https://arxiv.org/html/2605.16727#bib.bib2)\), where one agent proposes tasks and another solves them, is the structural ancestor of PopuLoRA’s teacher–student loop\. In the unsupervised environment design \(UED\) literature, PAIRED\(Denniset al\.,[2020](https://arxiv.org/html/2605.16727#bib.bib10)\)formalises regret\-based adversarial curriculum generation, POET\(Wanget al\.,[2019](https://arxiv.org/html/2605.16727#bib.bib11)\)co\-evolves environments and agents, and ACCEL\(Parker\-Holderet al\.,[2022](https://arxiv.org/html/2605.16727#bib.bib12)\)adds mutation operators on level structure; emergent autocurricula arise in multi\-agent competition\(Bakeret al\.,[2020](https://arxiv.org/html/2605.16727#bib.bib13)\)\. In the LLM setting, a growing line of work separates proposer and solver: SPIN\(Chenet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib38)\), Language Self\-Play\(Kubaet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib46)\), SOAR\(Sundaramet al\.,[2026](https://arxiv.org/html/2605.16727#bib.bib48)\), R\-Zero\(Huanget al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib60)\), ALIVE\(Duanet al\.,[2026](https://arxiv.org/html/2605.16727#bib.bib44)\), TriPlay\-RL\(Tanet al\.,[2026](https://arxiv.org/html/2605.16727#bib.bib45)\), and others \(Appendix[6](https://arxiv.org/html/2605.16727#S6)\)\. All are scalar in agent count \(≤\\leq3 agents total\)\. PopuLoRA differs in three ways: \(i\) we train*populations*of teachers and students rather than a single pair, \(ii\) updates are joint and on\-policy rather than alternating, and \(iii\) the difficulty signal comes from cross\-evaluation across the population rather than from a fixed target solve\-rate band\.

#### Population\-based training and LoRA evolution\.

Classical PBT\(Jaderberget al\.,[2017](https://arxiv.org/html/2605.16727#bib.bib29)\)copies and perturbs full agent weights; at 7B scale the per\-member footprint and the copy\-and\-perturb cost both become bottlenecks\. Recent work lifts evolution into adapter space \(GENOME\(Zhanget al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib51)\), EGGROLL\(Sarkaret al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib52)\), ESSA\(Korotyshovaet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib53)\), ES\-at\-Scale\(Qiuet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib54)\)\) but optimises against a*fixed*fitness function\. PopuLoRA is orthogonal: evolution serves as the replacement step of an RLVR self\-play loop, where the fitness signal is produced by the population itself through cross\-evaluation\. We embed LoRA merge operators \(DARE\(Yuet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib31)\), TIES\-inspired\(Yadavet al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib32)\), task arithmetic\(Ilharcoet al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib33)\), plus SVD\-structured mutations\) inside this*online*loop, so children re\-enter gradient training immediately after recombination, unlike offline merging\(Akibaet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib50); Fenget al\.,[2025b](https://arxiv.org/html/2605.16727#bib.bib55); Huanget al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib57)\)\. Further comparisons are in Appendix[6](https://arxiv.org/html/2605.16727#S6)\.

## 3\\vmaxkeepcasePopuLoRA

![Refer to caption](https://arxiv.org/html/2605.16727v1/x1.png)Figure 1:One PopuLoRA iteration\.Matched teacher–student pairs generate and solve under a sandboxed verifier; the student’s failure rate is the teacher’s reward; everykksteps, LoRA evolution replaces the weakest members\.PopuLoRA keeps AZR’s core self\-play loop but replaces the single agent with a population of specialised LoRA adapters and replaces self\-calibrated difficulty with cross\-evaluation between matched members\. Figure[1](https://arxiv.org/html/2605.16727#S3.F1)shows the resulting training step; the rest of this section makes each component concrete\. Section[3\.1](https://arxiv.org/html/2605.16727#S3.SS1)fixes the RLVR objective and reward functions; Section[3\.2](https://arxiv.org/html/2605.16727#S3.SS2)describes how the population is parameterised; Section[3\.3](https://arxiv.org/html/2605.16727#S3.SS3)walks through one training step; and Section[3\.4](https://arxiv.org/html/2605.16727#S3.SS4)specifies the LoRA weight\-space evolution operators that serve as PopuLoRA’s PBT replacement step\.

### 3\.1Objective and reward

Each population member optimises its policy by RLVR: a verifierVV\(a sandboxed Python executor plus a format checker\) emits a scalar reward on every rollout, with no learned reward model in the loop\. For a student rolloutτ\\tauon a problemppproposed by its matched teacher,

Rstu\(τ\)=\{\+1ifV\(τ,p\)=correct,−0\.5ifV\(τ,p\)=incorrect but well\-formed,−1if the response fails format\.R\_\{\\text\{stu\}\}\(\\tau\)\\;=\\;\\begin\{cases\}\+1&\\text\{if \}V\(\\tau,p\)=\\text\{correct,\}\\\\ \-0\.5&\\text\{if \}V\(\\tau,p\)=\\text\{incorrect but well\-formed,\}\\\\ \-1&\\text\{if the response fails format\.\}\\end\{cases\}\(1\)For a teacher\-proposed problemppsubsequently attempted by studentss,

Rtea\(p\)=\{−1ifpfails to parse, execute, or is non\-deterministic,0ifρ\(t,s,p\)=0,1−ρ\(t,s,p\)otherwise,R\_\{\\text\{tea\}\}\(p\)\\;=\\;\\begin\{cases\}\-1&\\text\{if \}p\\text\{ fails to parse, execute, or is non\-deterministic,\}\\\\ 0&\\text\{if \}\\rho\(t,s,p\)=0,\\\\ 1\-\\rho\(t,s,p\)&\\text\{otherwise,\}\\end\{cases\}\(2\)whereρ\(t,s,p\)\\rho\(t,s,p\)is the fraction of the student’s rollout samples that solvepp\. The zero\-reward case when no student solves the problem prevents teachers from being rewarded for generating impossible or degenerate problems\. The key structural change relative to single\-agent AZR is exactly in this equation: the teacher’s reward depends on the*matched*student’s failure rate, not on the proposer’s own solve rate, so difficulty is an inter\-population quantity rather than a self\-estimate\. Advantages are estimated with REINFORCE\+\+\(Huet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib6); Williams,[1992](https://arxiv.org/html/2605.16727#bib.bib5)\)\-baseline \(per\-prompt centring followed by global whitening across the batch\), in the critic\-free GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib3)\)family descended from PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.16727#bib.bib4)\), without a value network and without a KL penalty to a reference model \(βKL=0\\beta\_\{\\text\{KL\}\}\{=\}0\)\. Every policy update pools all three AZR problem types \(code\_i,code\_o,code\_f\) into a single mixed\-type batch per member per step, matching the single\-agent baseline’s pooling\.

### 3\.2Architecture

The population consists ofNTN\_\{T\}teacher andNSN\_\{S\}student LoRA adapters attached to a single shared frozen code\-LLM base\. Every adapter has the same rankrrand attaches to the same set of projection matrices; only the adapter weights are updated, while the base model remains frozen\. This layout has two immediate consequences\. First, memory cost scales with the sum of adapter sizes rather than the sum of full\-parameter base copies, which is what makes multi\-adapter populations affordable on commodity hardware: a base that is tens of gigabytes serves arbitrarily many adapters, each costing only tens of megabytes\. Second, the vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib16)\)multi\-LoRA scheduler\(Shenget al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib17)\)can dispatch each request to the correct adapter by tag inside a shared forward pass, so matched teachers and students generate and solve in the same batch without any per\-request base\-model swap, and swapping an adapter in or out of the rollout engine moves only the adapterΔW\\Delta W, not the base\. Concrete values ofNTN\_\{T\},NSN\_\{S\},rr, and the base model used in our experiments are specified in §[4\.1](https://arxiv.org/html/2605.16727#S4.SS1)\.

### 3\.3Training Step

Each training step proceeds in five phases\.\(1\) Matchmaking: each teacher is paired with one student via prioritised fictitious self\-play \(PFSP\)\(Vinyalset al\.,[2019](https://arxiv.org/html/2605.16727#bib.bib1)\)over TrueSkill\(Herbrichet al\.,[2006](https://arxiv.org/html/2605.16727#bib.bib67)\)ratings, concentrating pairings on informative near\-balanced matchups \(Appendix[6](https://arxiv.org/html/2605.16727#S6)\)\. PopuLoRA adapts PFSP from game\-playing agents to problem\-proposing teachers: opponents are teachers generating problems rather than policies playing the same game\.\(2\) Teacher generation: each teacher proposes a batch of AZR code problems \(split acrosscode\_i/code\_o/code\_f\), validated by the sandboxed executor\.\(3\) Student solve: the matched student attempts all valid problems under rollout; the executor produces a per\-prompt binary solve vector\.\(4\) Cross\-evaluation and update: the teacher’s reward on each valid prompt is1−ρji1\-\\rho\_\{j\_\{i\}\}\(Eq\.[2](https://arxiv.org/html/2605.16727#S3.E2)\), whereρji\\rho\_\{j\_\{i\}\}is the matched student’s solve rate, replacing AZR’s self\-estimation with an inter\-population signal\. All adapters are updated in a single mixed batch\.\(5\) Evolution: everykksteps, the bottom fractionγ\\gammaof each sub\-population \(ranked by TrueSkill lower\-confidence bound\) is replaced by children produced via LoRA weight\-space operators applied to top\-ranked parents \(§[3\.4](https://arxiv.org/html/2605.16727#S3.SS4)\)\.

### 3\.4LoRA Weight\-Space Evolution

The design goal is fast child creation: each operator has to emit a same\-rank LoRA adapter in seconds without any retraining, retain enough parent knowledge that the child can re\-enter training immediately, and inject enough diversity to move the child away from its parents in weight space\. We implement two families of operators: mutations \(M1–M4\) acting on a single parent and crossovers \(X1–X4\) combining two parents\. The eight operators described here are the ones we use throughout training; the broader catalog and the operators that failed retention tests are in Appendix[8](https://arxiv.org/html/2605.16727#S8)\.

#### Mutations\.

LetΔW=BA⊤∈ℝd×d\\Delta W=BA^\{\\top\}\\in\\mathbb\{R\}^\{d\\times d\}denote a single LoRA layer’s effective update\.M1 \(SVD\-structured\)takes the singular\-value decompositionΔW=UΣV⊤\\Delta W=U\\Sigma V^\{\\top\}, perturbs the singular spectrumΣ\\Sigmaby a small multiplicative noise, and applies a first\-order Cayley rotation toU,VU,V, preserving orthogonality while moving the child off the parent’s subspace\.M2 \(layer\-selective Gaussian\)draws a random 33 % subset of the layer\-module slots and adds Gaussian noise toA,BA,Bonly within that subset; the remaining two\-thirds of the adapter are copied verbatim\.M3 \(component masking\)zeroes a random subset of the SVD components ofΔW\\Delta W, reducing effective rank and forcing the child to relearn masked directions\.M4 \(full Gaussian\)adds per\-tensor adaptive Gaussian noise to everyA,BA,Bfactor, with the noise scale set by the tensor’s own running standard deviation\.

#### Crossovers\.

LetΔW\(1\),ΔW\(2\)\\Delta W^\{\(1\)\},\\Delta W^\{\(2\)\}be two parents\.X1 \(DARE\)applies the DARE drop\-and\-rescale recipe ofYuet al\.\([2024](https://arxiv.org/html/2605.16727#bib.bib31)\)to each parent’s delta then sums them\.X2 \(layer\-wise\)selects each layer\-module slot independently from either parent, giving a layer\-modular recombination reminiscent of TIES\-style sign alignment\(Yadavet al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib32)\)\.X3 \(SVD subspace\)takes the top\-kksingular components from one parent and fills the remainingr−kr\-kcomponents from the other, mixing parents within a common rank budget\.X4 \(extrapolative\)performs a task\-arithmetic\-style linear combination\(Ilharcoet al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib33)\)with a coefficient greater than one, extrapolating*beyond*the convex hull of the parents rather than interpolating between them\. All eight operators run in seconds on the adapter tensors alone, produce children at the same rank as the parents, and are validated as retention\-preserving in §[4\.6](https://arxiv.org/html/2605.16727#S4.SS6)\.

## 4Experiments

We evaluate PopuLoRA against a per\-adapter compute\-matched single\-agent AZR baseline on held\-out code and math benchmarks, plus training\-dynamics and problem\-complexity diagnostics, a population\-dynamics analysis, a LoRA\-operator retention test, and a population\-size ablation\. All comparisons share the same base model, reward scheme, and optimiser; the only difference is whether training runs one self\-calibrating agent or a teacher–student population\.

### 4\.1Setup

#### Instantiation\.

The specific values below are the ones we use throughout our main experiments; the method itself does not depend on any of these being fixed to these values\. We use a frozen Qwen2\.5\-Coder\-7B\(Huiet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib15)\)base, the same model used by AZR, which enables a direct comparison against their publicly released checkpoint\. Every population member is a rank\-32 LoRA adapter with scalingα=64\\alpha=64attached to the attention projections\{Wq,Wk,Wv,Wo\}\\\{W\_\{q\},W\_\{k\},W\_\{v\},W\_\{o\}\\\}\. The main population isNT=NS=4N\_\{T\}=N\_\{S\}=4\(we sweep population size as an ablation in §[4\.7](https://arxiv.org/html/2605.16727#S4.SS7)\)\. Training runs for 200 update steps with AdamW at learning rate5×10−55\\times 10^\{\-5\}and REINFORCE\+\+\-baseline advantages \(rolloutn=8n=8, temperatureT=1T=1, no KL penalty\)\. Each matchup generates a batch of 72 prompts split equally acrosscode\_i,code\_o,code\_f; the student solves the same split\. Evolution runs everyk=10k=10steps on the bottom fractionγ=0\.25\\gamma=0\.25of each sub\-population ranked by TrueSkill lower\-confidence bound \(ablated in §[4\.7](https://arxiv.org/html/2605.16727#S4.SS7)\)\. Full hyperparameters are listed in Appendix[7](https://arxiv.org/html/2605.16727#S7)\.

#### Evaluation\.

At evaluation time we merge each final adapter into the frozen base with PEFT and report greedy pass@1 on \(i\) code benchmarks HumanEval\+\(Chenet al\.,[2021](https://arxiv.org/html/2605.16727#bib.bib18); Liuet al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib19)\), MBPP\+\(Austinet al\.,[2021](https://arxiv.org/html/2605.16727#bib.bib20); Liuet al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib19)\), and LiveCodeBench v5\(Jainet al\.,[2024a](https://arxiv.org/html/2605.16727#bib.bib21)\), and \(ii\) math benchmarks AIME 24/25, AMC 23, MATH\-500\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.16727#bib.bib22); Lightmanet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib26)\), Minerva\(Lewkowyczet al\.,[2022](https://arxiv.org/html/2605.16727#bib.bib24)\), GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.16727#bib.bib23)\), and OlympiadBench\(Heet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib25)\)\. For the population we report the population mean as the headline result, and include the per\-benchmark argmax across adapters as a test\-set\-selected upper bound \(not a deployable selection rule, since the choice of adapter depends on the benchmark scores\); per\-adapter breakdowns are in Appendix[14](https://arxiv.org/html/2605.16727#S14)\.

### 4\.2Downstream Benchmarks

Figure[2](https://arxiv.org/html/2605.16727#S4.F2)reports greedy pass@1 across three code and seven math benchmarks, comparing the population’s mean and best teacher and best student against the per\-adapter compute\-matched baseline\. The 8T\+8S rows use their available 100\-gradient\-step checkpoint\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x2.png)Figure 2:Downstream pass@1\.Baseline AZR \(LoRA\) is per\-adapter compute\-matched\. The 8T\+8S population is evaluated at 100 gradient steps\. For each population size, the lighter bar \(the headline number\) shows the mean across all adapters; the darker cap shows the per\-benchmark argmax across adapters — a test\-set\-selected upper bound, not a deployable result, since the choice of adapter depends on the benchmark scores\. Error bars show±\\pm1 std across adapters\. Full per\-adapter breakdown in Appendix[14](https://arxiv.org/html/2605.16727#S14)\.#### Code\.

The population mean is at or above the per\-adapter compute\-matched baseline on every code benchmark \(Figure[2](https://arxiv.org/html/2605.16727#S4.F2)\):78\.7vs\. baseline LoRA 76\.8 on HumanEval\+,70\.8vs\. 69\.3 on MBPP\+, and27\.0vs\. 17\.3 on LiveCodeBench\. The largest gap sits on LiveCodeBench, which contains recent competitive\-programming problems from online contests and is therefore structurally different from the self\-generated code triples AZR trains on\.

#### Math\.

Despite AZR providing no direct math supervision \(its verifier is a Python executor on code triples, not a math grader\), PopuLoRA shows consistent out\-of\-domain gains\. The population mean improves on the baseline LoRA on every math benchmark we measured:11\.7vs\. 6\.7 on AIME 24,40\.0vs\. 37\.5 on AMC 23,69\.8vs\. 59\.4 on MATH\-500,27\.0vs\. 19\.1 on Minerva, and33\.9vs\. 27\.3 on OlympiadBench, with an average math gain of38\.2vs\. 34\.1\. Both arms improve over the frozen base, but the population’s margin is larger on every math benchmark, with the biggest absolute gaps on the competition\-level benchmarks \(AIME, OlympiadBench\) where the baseline barely moves from the base model\. Even the weakest member of the 4T\+4S population beats the baseline on aggregate \(Table[3](https://arxiv.org/html/2605.16727#S14.T3)\), confirming that co\-evolution lifts the entire population rather than concentrating gains in a few specialists\. Per\-benchmark argmax over the four adapters of each role \(a test\-set\-selected upper bound, not a deployable selection rule\) lifts the math gains further, particularly on AIME and AMC, where the population was never directly trained: see Table[3](https://arxiv.org/html/2605.16727#S14.T3)for the full per\-adapter breakdown\.

### 4\.3Training Dynamics: Self\-Calibration vs\. Co\-Adaptation

The training dynamics of the baseline and the population tell two qualitatively different stories \(Figure[3](https://arxiv.org/html/2605.16727#S4.F3)\)\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x3.png)Figure 3:Training dynamics\.Left two panels: solver \(solve rate, format rate\)\. Right two panels: teacher \(problem difficulty=1−solve rate=1\-\\text\{solve rate\}, validity rate\)\. Baseline in black, population mean in blue with per\-member spread\. Per\-type breakdown in Appendix[11](https://arxiv.org/html/2605.16727#S11)\.In the single\-agent baseline, the solver’s solve rate rises monotonically and plateaus within the first 200 steps\. A near\-perfect solve rate is not the right signal here: it means the teacher is unable to generate problems hard enough to make the student fail\. The single agent simultaneously learns to produce valid problems in the correct format*and*to solve everything it generates\. There is no increasing complexity, no co\-adaptation, no pressure to improve further: the system has found a stable fixed point where both roles are satisfied with minimal effort\. Notably, the teacher\-difficulty curves can look superficially similar between baseline and population, but they mean different things\. In the baseline, the teacher is rewarded against its own solver’s solve rate — difficulty is a self\-estimate by the same network that proposes the problem; in the population, student failures directly reflect a separate, matched student’s solve rate, so difficulty is an inter\-population signal\.

The population’s dynamics are strikingly different\. Student solve rates oscillate throughout training rather than monotonically rising\. This pattern has a natural explanation: as teachers co\-adapt and generate harder problems, students start failing; once students catch up, teachers are pushed to produce yet harder problems, and the cycle repeats\. This phasic dynamic is the signature of a genuine co\-evolutionary arms race rather than a self\-calibrating fixed point\.

Crucially, the population’s seemingly lower training\-time solve rate is a sign of strength\. When we evaluate the end\-of\-training checkpoints on held\-out benchmarks \(Figure[2](https://arxiv.org/html/2605.16727#S4.F2)\), they consistently outperform the baseline whose training curve looked near\-perfect\. The population’s teachers kept pushing difficulty upward, which forced the students to develop capabilities the baseline never needed\.

The per\-type breakdown \(Appendix[11](https://arxiv.org/html/2605.16727#S11), Figure[10](https://arxiv.org/html/2605.16727#S11.F10)\) reveals that the oscillation runs at different frequencies across task types:code\_ocycles rapidly between teacher\- and student\-dominated phases, whilecode\_iandcode\_foscillate more slowly, consistent with these tasks requiring deeper program understanding to sustain difficulty pressure\. The baseline’s policy entropy collapses to near zero while the population’s teachers maintain non\-trivial entropy throughout training \(Appendix[16](https://arxiv.org/html/2605.16727#S16)\), and the population’s response length grows to∼1000\{\\sim\}1000tokens versus the baseline’s∼250\{\\sim\}250\(Appendix[17](https://arxiv.org/html/2605.16727#S17)\), consistent with more elaborate reasoning\(Guoet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib39)\)\.

### 4\.4Problem Complexity: Collapse vs\. Growth

The training\-dynamics story is corroborated by direct measurements of problem complexity \(Figure[4](https://arxiv.org/html/2605.16727#S4.F4)\)\. We track four structural metrics of teacher\-generated programs over training: AST depth, cyclomatic complexity, lines of code, and variable count \(definitions in App\.[10](https://arxiv.org/html/2605.16727#S10)\)\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x4.png)Figure 4:Program complexity over training\.Baseline \(black\) trends downward on all four axes; population \(blue\) trends upward\. Coverage analysis in Appendix[10](https://arxiv.org/html/2605.16727#S10)\.The difference is clear\. In every panel the baseline curves trend*downward*: the single\-agent teacher learns to produce progressively simpler programs along every axis, converging on the simplest programs it can consistently generate in valid format and solve\. The population teachers show the opposite trajectory: all four complexity metrics rise throughout training\. Cross\-evaluation rewards teachers for problems that are hard for the matched student, which creates sustained upward pressure on problem difficulty rather than the downward drift of self\-calibration\.

Problem\-space coverage tells the same story: we tile the structural feature space with a CVT archive\(Vassiliadeset al\.,[2018](https://arxiv.org/html/2605.16727#bib.bib9)\)of 4 096 cells and track the fraction filled over training \(Appendix[10](https://arxiv.org/html/2605.16727#S10)\)\. Baseline coverage plateaus early; the population keeps expanding through 200 steps, generating increasingly diverse problems alongside increasingly complex ones\. By step 100 the baseline has collapsed to programs as trivial asreturn number \* 3\(Appendix[9](https://arxiv.org/html/2605.16727#S9)\)\.

### 4\.5Population Dynamics: Arms Race and Specialisation

The TrueSkill ratings provide a direct view of the co\-evolutionary dynamics \(Figure[5](https://arxiv.org/html/2605.16727#S4.F5)\)\. Three features stand out\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x5.png)Figure 5:TrueSkillμ\\muand arms race\.Left/centre: per\-adapter ratings \(light\) and role mean \(bold\)\. Right: matchup outcome from student \(blue\) vs\. teacher \(orange\) perspective; the lead alternates throughout training\.Individual adapters differentiate from the population mean as training progresses\. Early on, all members cluster near the priorμ=25\\mu\{=\}25; by mid\-training, distinct high and low performers have emerged in both sub\-populations, indicating that the population dynamics produce genuine specialisation rather than homogeneous copies\.

The right panel reveals an oscillating arms race between the two roles\. There are periods where teachers dominate — their problems are too hard for the matched students — but students eventually catch up, and the cycle repeats\. This alternating lead is a characteristic of adversarial co\-evolution: neither role is able to settle into a fixed strategy because the other keeps adapting\. Per\-student solve\-rate profiles against different teachers are non\-uniform \(Appendix[13](https://arxiv.org/html/2605.16727#S13)\), confirming that specialisation extends to the matchup level\.

### 4\.6LoRA Operator Retention

All eight operators that ship in the live population pass the retention test \(Figure[6](https://arxiv.org/html/2605.16727#S4.F6)\): every child recovers to parent\-level reward within 10–20 update steps after being re\-injected into training\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x6.png)Figure 6:LoRA operator retention \(snapshot step 25\)\.Top: mutations \(parent in grey\)\. Bottom: crossovers \(two parents in grey; trained on different task types\)\. All children recover to near\-parent performance within∼20\{\\sim\}20steps\. Full operator grid in Appendix[15](https://arxiv.org/html/2605.16727#S15)\.The mutation results \(top row\) confirm that perturbed children start close to their parent and resume gradient updates without resetting to the frozen base, validating the operators as a legitimate PBT replacement step\. The crossover results \(bottom row\) combine parents trained on different AZR task types \(e\.g\. one specialised on induction and the other on output prediction\)\. The crossover child retains performance on*both*parents’ tasks, demonstrating that weight\-space recombination can compose complementary specialisations into a single adapter\. See Appendix[15](https://arxiv.org/html/2605.16727#S15)for the full operator grid across all snapshot steps\.

### 4\.7Population Size Ablation

The population size ablation \(Figure[7](https://arxiv.org/html/2605.16727#S4.F7)\) reveals that the co\-evolutionary dynamics we observe are not simply a consequence of having two roles — they require a population\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x7.png)Figure 7:Population size ablation\.Even a single teacher–student pair \(1T\+1S\) avoids the baseline’s mode collapse\. Co\-evolutionary oscillations become more pronounced at 4T\+4S and 8T\+8S\. The 8T\+8S run shown here stops at 100 gradient steps\.Even at the smallest population size, 1T\+1S, decoupling the teacher and student into separate adapters is enough to avoid the baseline’s mode collapse: the solver reward does not plateau at near\-perfect levels, and downstream evaluation \(Table[3](https://arxiv.org/html/2605.16727#S14.T3)\) shows that 1T\+1S already outperforms the single\-agent baseline on most benchmarks\. The oscillation pattern is less pronounced than in larger populations, but the structural separation of proposer and solver prevents the self\-calibrating fixed point that limits single\-agent training\.

At 4T\+4S the co\-evolutionary oscillations become clearly visible and the reward dynamics show the phasic arms race described above\. This is the configuration we use for the main comparison\. At 8T\+8S the oscillations are more pronounced, though the downstream gains depend on the benchmark \(Table[3](https://arxiv.org/html/2605.16727#S14.T3)\); both the ablation trace and downstream evaluation for 8T\+8S use the 100\-gradient\-step checkpoint\.

## 5Discussion and Conclusion

PopuLoRA replaces single\-agent self\-calibration with co\-evolving teacher–student LoRA populations on a shared frozen base\. Where the single\-agent baseline converges to a fixed point of easy problems and near\-perfect reward, the population sustains an arms race that produces increasingly complex problems and stronger downstream performance on every code and math benchmark we measured\. Even the weakest member of the 4T\+4S population outperforms the baseline on aggregate, indicating that co\-evolution lifts the entire population rather than concentrating gains in a few specialists \(Table[3](https://arxiv.org/html/2605.16727#S14.T3)\)\. Even a minimal 1T\+1S configuration, which decouples proposer and solver without population dynamics, already improves over the baseline, confirming that structural asymmetry is the primary driver\.

#### Compute cost\.

Per\-adapter compute matches the baseline: same tokens, rollouts, and updates per step\. Total work scales with adapter count, but wall\-clock scales sub\-linearly because vLLM multi\-LoRA batching shares the frozen base forward pass\(Shenget al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib17)\)\. On the same 1×\\times8×\\timesH100 node, 4T\+4S trains8×8\{\\times\}more adapters for only1\.31×1\.31\{\\times\}the baseline wall\-clock \(Appendix[7](https://arxiv.org/html/2605.16727#S7), Table[2](https://arxiv.org/html/2605.16727#S7.T2)\)\.

#### What LoRA evolution does not replace\.

Evolution operators recombine and perturb existing adapters; new behaviour still comes from policy\-gradient updates after injection\.

#### Limitations\.

We fix the LoRA rank to 32 across all population members; heterogeneous ranks could further diversify the population but we have not explored this axis\. All experiments use a single base model \(Qwen2\.5\-Coder\-7B\-Instruct\) and a single verifier domain \(sandboxed Python execution\), so it remains open whether the co\-evolutionary dynamics transfer to other base scales or to domains with weaker verifiers\. Code will be released upon publication\.

## References

- T\. Akiba, M\. Shing, Y\. Tang, Q\. Sun, and D\. Ha \(2025\)Evolutionary optimization of model merging recipes\.Nature Machine Intelligence7,pp\. 195–204\.External Links:[Document](https://dx.doi.org/10.1038/s42256-024-00975-8),[Link](https://www.nature.com/articles/s42256-024-00975-8)Cited by:[item 2](https://arxiv.org/html/2605.16727#S1.I1.i2.p1.1),[§1](https://arxiv.org/html/2605.16727#S1.p3.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le, and C\. Sutton \(2021\)Program synthesis with large language models\.External Links:2108\.07732,[Document](https://dx.doi.org/10.48550/arXiv.2108.07732),[Link](https://arxiv.org/abs/2108.07732)Cited by:[§4\.1](https://arxiv.org/html/2605.16727#S4.SS1.SSS0.Px2.p1.1)\.
- B\. Baker, I\. Kanitscheider, T\. Markov, Y\. Wu, G\. Powell, B\. McGrew, and I\. Mordatch \(2020\)Emergent tool use from multi\-agent autocurricula\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SkxpxJBKwS)Cited by:[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px2.p1.1)\.
- E\. L\. Buehler and M\. J\. Buehler \(2024\)X\-LoRA: mixture of low\-rank adapter experts, a flexible framework for large language models with applications in protein mechanics and molecular design\.External Links:2402\.07148,[Document](https://dx.doi.org/10.48550/arXiv.2402.07148),[Link](https://arxiv.org/abs/2402.07148)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p3.1),[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px2.p1.1)\.
- J\. Chen, B\. Zhang, R\. Ma, P\. Wang, X\. Liang, Z\. Tu, X\. Li, and K\. K\. Wong \(2025\)SPC: evolving self\-play critic via adversarial games for LLM reasoning\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=JddJvNSiHk)Cited by:[item 1](https://arxiv.org/html/2605.16727#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.16727#S1.p2.1),[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px1.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.External Links:2107\.03374,[Document](https://dx.doi.org/10.48550/arXiv.2107.03374),[Link](https://arxiv.org/abs/2107.03374)Cited by:[§4\.1](https://arxiv.org/html/2605.16727#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Chen, Y\. Deng, H\. Yuan, K\. Ji, and Q\. Gu \(2024\)Self\-play fine\-tuning converts weak language models to strong language models\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 6621–6642\.External Links:[Link](https://proceedings.mlr.press/v235/chen24j.html)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p1.1),[§1](https://arxiv.org/html/2605.16727#S1.p2.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168,[Document](https://dx.doi.org/10.48550/arXiv.2110.14168),[Link](https://arxiv.org/abs/2110.14168)Cited by:[§4\.1](https://arxiv.org/html/2605.16727#S4.SS1.SSS0.Px2.p1.1)\.
- P\. T\. Deep, R\. Bhardwaj, and S\. Poria \(2024\)DELLA\-Merging: reducing interference in model merging through magnitude\-based sampling\.External Links:2406\.11617,[Document](https://dx.doi.org/10.48550/arXiv.2406.11617),[Link](https://arxiv.org/abs/2406.11617)Cited by:[§8](https://arxiv.org/html/2605.16727#S8.SS0.SSS0.Px14.p1.6)\.
- M\. Dennis, N\. Jaques, E\. Vinitsky, A\. Bayen, S\. Russell, A\. Critch, and S\. Levine \(2020\)Emergent complexity and zero\-shot transfer via unsupervised environment design\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 13049–13061\.External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/985e9a46e10005356bbaf194249f6856-Abstract.html)Cited by:[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Duan, J\. Ye, and X\. Zhao \(2026\)ALIVE: awakening LLM reasoning via adversarial learning and instructive verbal evaluation\.External Links:2602\.05472,[Document](https://dx.doi.org/10.48550/arXiv.2602.05472),[Link](https://arxiv.org/abs/2602.05472)Cited by:[item 1](https://arxiv.org/html/2605.16727#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.16727#S1.p2.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Feng, Z\. Wang, P\. Goyal, Y\. Wang, W\. Shi, H\. Xia, H\. Palangi, L\. Zettlemoyer, Y\. Tsvetkov, C\. Lee, and T\. Pfister \(2025a\)Heterogeneous swarms: jointly optimizing model roles and weights for multi\-LLM systems\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=zYEZ5KqtDO)Cited by:[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px2.p1.1)\.
- S\. Feng, Z\. Wang, Y\. Wang, S\. Ebrahimi, H\. Palangi, L\. Miculicich, A\. Kulshrestha, N\. Rauschmayr, Y\. Choi, Y\. Tsvetkov, C\. Lee, and T\. Pfister \(2025b\)Model swarms: collaborative search to adapt LLM experts via swarm intelligence\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 16904–16930\.External Links:[Link](https://proceedings.mlr.press/v267/feng25o.html)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p3.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Fernando, D\. S\. Banarse, H\. Michalewski, S\. Osindero, and T\. Rocktäschel \(2024\)Promptbreeder: self\-referential self\-improvement via prompt evolution\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 13481–13544\.External Links:[Link](https://proceedings.mlr.press/v235/fernando24a.html)Cited by:[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px2.p1.1)\.
- X\. Guan, L\. L\. Zhang, Y\. Liu, N\. Shang, Y\. Sun, Y\. Zhu, F\. Yang, and M\. Yang \(2025\)rStar\-Math: small LLMs can master math reasoning with self\-evolved deep thinking\.External Links:2501\.04519,[Document](https://dx.doi.org/10.48550/arXiv.2501.04519),[Link](https://arxiv.org/abs/2501.04519)Cited by:[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning\.Nature645,pp\. 633–638\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09422-z),[Link](https://www.nature.com/articles/s41586-025-09422-z)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p1.1),[§17](https://arxiv.org/html/2605.16727#S17.p1.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.16727#S4.SS3.p5.2)\.
- C\. He, R\. Luo, Y\. Bai, S\. Hu, Z\. Thai, J\. Shen, J\. Hu, X\. Han, Y\. Huang, Y\. Zhang, J\. Liu, L\. Qi, Z\. Liu, and M\. Sun \(2024\)OlympiadBench: a challenging benchmark for promoting AGI with olympiad\-level bilingual multimodal scientific problems\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 3828–3850\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.211),[Link](https://aclanthology.org/2024.acl-long.211/)Cited by:[§4\.1](https://arxiv.org/html/2605.16727#S4.SS1.SSS0.Px2.p1.1)\.
- J\. Heinrich and D\. Silver \(2016\)Deep reinforcement learning from self\-play in imperfect\-information games\.External Links:1603\.01121,[Document](https://dx.doi.org/10.48550/arXiv.1603.01121),[Link](https://arxiv.org/abs/1603.01121)Cited by:[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px3.p1.9)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track,External Links:[Link](https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html)Cited by:[§4\.1](https://arxiv.org/html/2605.16727#S4.SS1.SSS0.Px2.p1.1)\.
- R\. Herbrich, T\. Minka, and T\. Graepel \(2006\)TrueSkill: a bayesian skill rating system\.InAdvances in Neural Information Processing Systems,Vol\.19\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2006/hash/f44ee263952e65b3610b8ba51229d1f9-Abstract.html)Cited by:[§3\.3](https://arxiv.org/html/2605.16727#S3.SS3.p1.4),[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px3.p1.9)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p3.1)\.
- J\. Hu, J\. K\. Liu, H\. Xu, and W\. Shen \(2025\)REINFORCE\+\+: stabilizing critic\-free policy optimization with global advantage normalization\.External Links:2501\.03262,[Document](https://dx.doi.org/10.48550/arXiv.2501.03262),[Link](https://arxiv.org/abs/2501.03262)Cited by:[§3\.1](https://arxiv.org/html/2605.16727#S3.SS1.p1.8)\.
- C\. Huang, Q\. Liu, B\. Y\. Lin, T\. Pang, C\. Du, and M\. Lin \(2023\)LoRAHub: efficient cross\-task generalization via dynamic LoRA composition\.External Links:2307\.13269,[Document](https://dx.doi.org/10.48550/arXiv.2307.13269),[Link](https://arxiv.org/abs/2307.13269)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p3.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px2.p1.1)\.
- C\. Huang, W\. Yu, X\. Wang, H\. Zhang, Z\. Li, R\. Li, J\. Huang, H\. Mi, and D\. Yu \(2025\)R\-Zero: self\-evolving reasoning LLM from zero data\.External Links:2508\.05004,[Document](https://dx.doi.org/10.48550/arXiv.2508.05004),[Link](https://arxiv.org/abs/2508.05004)Cited by:[item 1](https://arxiv.org/html/2605.16727#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.16727#S1.p2.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Hui, J\. Yang, Z\. Cui, J\. Yang, D\. Liu, L\. Zhang, T\. Liu, J\. Zhang, B\. Yu, K\. Lu, K\. Dang, Y\. Fan, Y\. Zhang, A\. Yang, R\. Men, F\. Huang, B\. Zheng, Y\. Miao, S\. Quan, Y\. Feng, X\. Ren, X\. Ren, J\. Zhou, and J\. Lin \(2024\)Qwen2\.5\-Coder technical report\.External Links:2409\.12186,[Document](https://dx.doi.org/10.48550/arXiv.2409.12186),[Link](https://arxiv.org/abs/2409.12186)Cited by:[§4\.1](https://arxiv.org/html/2605.16727#S4.SS1.SSS0.Px1.p1.8)\.
- G\. Ilharco, M\. T\. Ribeiro, M\. Wortsman, S\. Gururangan, L\. Schmidt, H\. Hajishirzi, and A\. Farhadi \(2023\)Editing models with task arithmetic\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=6t0Kwf8-jrj)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p3.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px3.p1.1),[§3\.4](https://arxiv.org/html/2605.16727#S3.SS4.SSS0.Px2.p1.3),[§8](https://arxiv.org/html/2605.16727#S8.SS0.SSS0.Px11.p1.4),[§8](https://arxiv.org/html/2605.16727#S8.SS0.SSS0.Px12.p1.2)\.
- M\. Jaderberg, V\. Dalibard, S\. Osindero, W\. M\. Czarnecki, J\. Donahue, A\. Razavi, O\. Vinyals, T\. Green, I\. Dunning, K\. Simonyan, C\. Fernando, and K\. Kavukcuoglu \(2017\)Population based training of neural networks\.External Links:1711\.09846,[Document](https://dx.doi.org/10.48550/arXiv.1711.09846),[Link](https://arxiv.org/abs/1711.09846)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p2.1),[§1](https://arxiv.org/html/2605.16727#S1.p3.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px3.p1.1)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2024a\)LiveCodeBench: holistic and contamination free evaluation of large language models for code\.External Links:2403\.07974,[Document](https://dx.doi.org/10.48550/arXiv.2403.07974),[Link](https://arxiv.org/abs/2403.07974)Cited by:[§4\.1](https://arxiv.org/html/2605.16727#S4.SS1.SSS0.Px2.p1.1)\.
- N\. Jain, P\. Chiang, Y\. Wen, J\. Kirchenbauer, H\. Chu, G\. Somepalli, B\. R\. Bartoldson, B\. Kailkhura, A\. Schwarzschild, A\. Saha, M\. Goldblum, J\. Geiping, and T\. Goldstein \(2024b\)NEFTune: noisy embeddings improve instruction finetuning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=0bMmZ3fkCk)Cited by:[§8](https://arxiv.org/html/2605.16727#S8.SS0.SSS0.Px5.p1.6)\.
- S\. Jana, C\. Sancaktar, T\. Daniš, G\. Martius, A\. Orvieto, and P\. Kolev \(2026\)GASP: guided asymmetric self\-play for coding LLMs\.InICLR 2026 Workshop on AI with Recursive Self\-Improvement,Note:Spotlight; also accepted to the ICLR 2026 Workshop on Lifelong AgentsExternal Links:2603\.15957,[Link](https://openreview.net/forum?id=ChWC0E93lF)Cited by:[item 1](https://arxiv.org/html/2605.16727#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.16727#S1.p2.1),[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px1.p1.1)\.
- D\. Korotyshova, B\. Shaposhnikov, A\. Malakhov, A\. Khokhulin, N\. Surnachev, K\. Ovcharenko, G\. Bredis, A\. Gorbatovski, V\. Sinii, and D\. Gavrilov \(2025\)Evolutionary strategies for scalable alignment\.External Links:2507\.04453,[Document](https://dx.doi.org/10.48550/arXiv.2507.04453),[Link](https://arxiv.org/abs/2507.04453)Cited by:[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px2.p1.1)\.
- J\. G\. Kuba, M\. Gu, Q\. Ma, Y\. Tian, V\. Mohan, and J\. Chen \(2025\)Language self\-play for data\-free training\.External Links:2509\.07414,[Document](https://dx.doi.org/10.48550/arXiv.2509.07414),[Link](https://arxiv.org/abs/2509.07414)Cited by:[item 1](https://arxiv.org/html/2605.16727#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.16727#S1.p2.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with PagedAttention\.InProceedings of the 29th Symposium on Operating Systems Principles,pp\. 611–626\.External Links:[Document](https://dx.doi.org/10.1145/3600006.3613165),[Link](https://doi.org/10.1145/3600006.3613165)Cited by:[§3\.2](https://arxiv.org/html/2605.16727#S3.SS2.p1.7)\.
- N\. Lambert, J\. Morrison, V\. Pyatkin, S\. Huang, H\. Ivison, F\. Brahman, L\. J\. V\. Miranda, A\. Liu, N\. Dziri, S\. Lyu, Y\. Gu, S\. Malik, V\. Graf, J\. D\. Hwang, J\. Yang, R\. Le Bras, O\. Tafjord, C\. Wilhelm, L\. Soldaini, N\. A\. Smith, Y\. Wang, P\. Dasigi, and H\. Hajishirzi \(2024\)Tülu 3: pushing frontiers in open language model post\-training\.External Links:2411\.15124,[Document](https://dx.doi.org/10.48550/arXiv.2411.15124),[Link](https://arxiv.org/abs/2411.15124)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p1.1)\.
- A\. Lewkowycz, A\. Andreassen, D\. Dohan, E\. Dyer, H\. Michalewski, V\. Ramasesh, A\. Slone, C\. Anil, I\. Schlag, T\. Gutman\-Solo, Y\. Wu, B\. Neyshabur, G\. Gur\-Ari, and V\. Misra \(2022\)Solving quantitative reasoning problems with language models\.InAdvances in Neural Information Processing Systems,Vol\.35\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/18abbeef8cfe9203fdf9053c9c4fe191-Abstract-Conference.html)Cited by:[§4\.1](https://arxiv.org/html/2605.16727#S4.SS1.SSS0.Px2.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=v8L0pN6EOi)Cited by:[§4\.1](https://arxiv.org/html/2605.16727#S4.SS1.SSS0.Px2.p1.1)\.
- B\. Liu, L\. Guertler, S\. Yu, Z\. Liu, P\. Qi, D\. Balcells, M\. Liu, C\. Tan, W\. Shi, M\. Lin, W\. S\. Lee, and N\. Jaques \(2026\)SPIRAL: self\-play on zero\-sum games incentivizes reasoning via multi\-agent multi\-turn reinforcement learning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=7Yayy5fNLg)Cited by:[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px1.p1.1)\.
- B\. Liu, C\. Jin, S\. Kim, W\. Yuan, W\. Zhao, I\. Kulikov, X\. Li, S\. Sukhbaatar, J\. Lanchantin, and J\. Weston \(2025\)SPICE: self\-play in corpus environments improves reasoning\.External Links:2510\.24684,[Document](https://dx.doi.org/10.48550/arXiv.2510.24684),[Link](https://arxiv.org/abs/2510.24684)Cited by:[item 1](https://arxiv.org/html/2605.16727#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.16727#S1.p2.1),[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px1.p1.1)\.
- J\. Liu, C\. S\. Xia, Y\. Wang, and L\. Zhang \(2023\)Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=1qvx610Cu7)Cited by:[§4\.1](https://arxiv.org/html/2605.16727#S4.SS1.SSS0.Px2.p1.1)\.
- M\. S\. Matena and C\. Raffel \(2022\)Merging models with fisher\-weighted averaging\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 17703–17716\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/70c26937fbf3d4600b69a129031b66ec-Abstract-Conference.html)Cited by:[§8](https://arxiv.org/html/2605.16727#S8.SS0.SSS0.Px16.p1.3)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. F\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 27730–27744\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p1.1)\.
- J\. Parker\-Holder, M\. Jiang, M\. Dennis, M\. Samvelyan, J\. Foerster, E\. Grefenstette, and T\. Rocktäschel \(2022\)Evolving curricula with regret\-based environment design\.InProceedings of the 39th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.162,pp\. 17473–17498\.External Links:[Link](https://proceedings.mlr.press/v162/parker-holder22a.html)Cited by:[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Qiu, Y\. Gan, C\. F\. Hayes, Q\. Liang, Y\. Xu, R\. Dailey, E\. Meyerson, B\. Hodjat, and R\. Miikkulainen \(2025\)Evolution strategies at scale: LLM fine\-tuning beyond reinforcement learning\.External Links:2509\.24372,[Document](https://dx.doi.org/10.48550/arXiv.2509.24372),[Link](https://arxiv.org/abs/2509.24372)Cited by:[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px2.p1.1)\.
- B\. Sarkar, M\. Fellows, J\. A\. Duque, A\. Letcher, A\. L\. Villares, A\. Sims, C\. Wibault, D\. Samsonov, D\. Cope, J\. Liesen, K\. Li, L\. Seier, T\. Wolf, U\. Berdica, V\. Mohl, A\. D\. Goldie, A\. Courville, K\. Sevegnani, S\. Whiteson, and J\. N\. Foerster \(2025\)Evolution strategies at the hyperscale\.External Links:2511\.16652,[Document](https://dx.doi.org/10.48550/arXiv.2511.16652),[Link](https://arxiv.org/abs/2511.16652)Cited by:[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.External Links:1707\.06347,[Document](https://dx.doi.org/10.48550/arXiv.1707.06347),[Link](https://arxiv.org/abs/1707.06347)Cited by:[§3\.1](https://arxiv.org/html/2605.16727#S3.SS1.p1.8)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Document](https://dx.doi.org/10.48550/arXiv.2402.03300),[Link](https://arxiv.org/abs/2402.03300)Cited by:[§3\.1](https://arxiv.org/html/2605.16727#S3.SS1.p1.8)\.
- Y\. Sheng, S\. Cao, D\. Li, C\. Hooper, N\. Lee, S\. Yang, C\. Chou, B\. Zhu, L\. Zheng, K\. Keutzer, J\. E\. Gonzalez, and I\. Stoica \(2024\)S\-LoRA: serving thousands of concurrent LoRA adapters\.InProceedings of Machine Learning and Systems,Vol\.6,pp\. 296–311\.External Links:[Link](https://proceedings.mlsys.org/paper_files/paper/2024/hash/906419cd502575b617cc489a1a696a67-Abstract-Conference.html)Cited by:[§3\.2](https://arxiv.org/html/2605.16727#S3.SS2.p1.7),[§5](https://arxiv.org/html/2605.16727#S5.SS0.SSS0.Px1.p1.4),[§7](https://arxiv.org/html/2605.16727#S7.SS0.SSS0.Px2.p1.5)\.
- S\. Sukhbaatar, Z\. Lin, I\. Kostrikov, G\. Synnaeve, A\. Szlam, and R\. Fergus \(2018\)Intrinsic motivation and automatic curricula via asymmetric self\-play\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SkT5Yg-RZ)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p2.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Sundaram, J\. Quan, A\. Kwiatkowski, K\. Ahuja, Y\. Ollivier, and J\. Kempe \(2026\)Teaching models to teach themselves: reasoning at the edge of learnability\.External Links:2601\.18778,[Document](https://dx.doi.org/10.48550/arXiv.2601.18778),[Link](https://arxiv.org/abs/2601.18778)Cited by:[item 1](https://arxiv.org/html/2605.16727#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.16727#S1.p2.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Tan, W\. Yu, J\. Si, T\. Liu, K\. Guan, H\. Jin, J\. Tao, X\. Yuan, D\. Ma, X\. Zhang, T\. Yang, and L\. Sun \(2026\)TriPlay\-RL: tri\-role self\-play reinforcement learning for LLM safety alignment\.External Links:2601\.18292,[Document](https://dx.doi.org/10.48550/arXiv.2601.18292),[Link](https://arxiv.org/abs/2601.18292)Cited by:[item 1](https://arxiv.org/html/2605.16727#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.16727#S1.p2.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Valipour, M\. Rezagholizadeh, I\. Kobyzev, and A\. Ghodsi \(2023\)DyLoRA: parameter\-efficient tuning of pre\-trained models using dynamic search\-free low\-rank adaptation\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,Dubrovnik, Croatia,pp\. 3274–3287\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.eacl-main.239),[Link](https://aclanthology.org/2023.eacl-main.239/)Cited by:[§8](https://arxiv.org/html/2605.16727#S8.SS0.SSS0.Px6.p1.4)\.
- V\. Vassiliades, K\. Chatzilygeroudis, and J\. Mouret \(2018\)Using centroidal Voronoi tessellations to scale up the multidimensional archive of phenotypic elites algorithm\.IEEE Transactions on Evolutionary Computation22\(4\),pp\. 623–630\.External Links:[Document](https://dx.doi.org/10.1109/TEVC.2017.2735550)Cited by:[§4\.4](https://arxiv.org/html/2605.16727#S4.SS4.p3.1)\.
- O\. Vinyals, I\. Babuschkin, W\. M\. Czarnecki, M\. Mathieu, A\. Dudzik, J\. Chung, D\. H\. Choi, R\. Powell, T\. Ewalds, P\. Georgiev, J\. Oh, D\. Horgan, M\. Kroiss, I\. Danihelka, A\. Huang, L\. Sifre, T\. Cai, J\. P\. Agapiou, M\. Jaderberg, A\. S\. Vezhnevets, R\. Leblond, T\. Pohlen, V\. Dalibard, D\. Budden, Y\. Sulsky, J\. Molloy, T\. L\. Paine, C\. Gulcehre, Z\. Wang, T\. Pfaff, Y\. Wu, R\. Ring, D\. Yogatama, D\. Wünsch, K\. McKinney, O\. Smith, T\. Schaul, T\. Lillicrap, K\. Kavukcuoglu, D\. Hassabis, C\. Apps, and D\. Silver \(2019\)Grandmaster level in StarCraft II using multi\-agent reinforcement learning\.Nature575\(7782\),pp\. 350–354\.External Links:[Document](https://dx.doi.org/10.1038/s41586-019-1724-z)Cited by:[§3\.3](https://arxiv.org/html/2605.16727#S3.SS3.p1.4),[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px3.p1.9)\.
- R\. Wang, J\. Lehman, J\. Clune, and K\. O\. Stanley \(2019\)POET: open\-ended coevolution of environments and their optimized solutions\.InProceedings of the Genetic and Evolutionary Computation Conference,pp\. 142–151\.External Links:[Document](https://dx.doi.org/10.1145/3321707.3321799),[Link](https://doi.org/10.1145/3321707.3321799)Cited by:[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px2.p1.1)\.
- R\. J\. Williams \(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine Learning8,pp\. 229–256\.External Links:[Document](https://dx.doi.org/10.1007/BF00992696)Cited by:[§3\.1](https://arxiv.org/html/2605.16727#S3.SS1.p1.8)\.
- P\. Yadav, D\. Tam, L\. Choshen, C\. Raffel, and M\. Bansal \(2023\)TIES\-Merging: resolving interference when merging models\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 7093–7115\.External Links:[Link](https://openreview.net/forum?id=xtaX3WyCj1)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p3.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px3.p1.1),[§3\.4](https://arxiv.org/html/2605.16727#S3.SS4.SSS0.Px2.p1.3),[§8](https://arxiv.org/html/2605.16727#S8.SS0.SSS0.Px13.p1.2),[§8](https://arxiv.org/html/2605.16727#S8.SS0.SSS0.Px9.p1.2)\.
- Z\. Ye, R\. Agarwal, T\. Liu, R\. Joshi, S\. Velury, Q\. V\. Le, Q\. Tan, and Y\. Liu \(2024\)Scalable Reinforcement Post\-Training Beyond Static Human Prompts: Evolving Alignment via Asymmetric Self\-Play\.External Links:2411\.00062,[Document](https://dx.doi.org/10.48550/arXiv.2411.00062),[Link](https://arxiv.org/abs/2411.00062)Cited by:[item 1](https://arxiv.org/html/2605.16727#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.16727#S1.p2.1),[§6](https://arxiv.org/html/2605.16727#S6.SS0.SSS0.Px1.p1.1)\.
- L\. Yu, B\. Yu, H\. Yu, F\. Huang, and Y\. Li \(2024\)Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 57755–57775\.External Links:[Link](https://proceedings.mlr.press/v235/yu24p.html)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p3.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px3.p1.1),[§3\.4](https://arxiv.org/html/2605.16727#S3.SS4.SSS0.Px2.p1.3),[§8](https://arxiv.org/html/2605.16727#S8.SS0.SSS0.Px8.p1.7)\.
- W\. Yuan, R\. Y\. Pang, K\. Cho, X\. Li, S\. Sukhbaatar, J\. Xu, and J\. E\. Weston \(2024\)Self\-rewarding language models\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 57905–57923\.External Links:[Link](https://proceedings.mlr.press/v235/yuan24d.html)Cited by:[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Zelikman, Y\. Wu, J\. Mu, and N\. D\. Goodman \(2022\)STaR: bootstrapping reasoning with reasoning\.InAdvances in Neural Information Processing Systems,Vol\.35\.External Links:[Link](https://openreview.net/forum?id=_3ELRdg2sgI)Cited by:[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhang, P\. Ye, X\. Yang, S\. Feng, S\. Zhang, L\. Bai, W\. Ouyang, and S\. Hu \(2025\)Nature\-inspired population\-based evolution of large language models\.External Links:2503\.01155,[Document](https://dx.doi.org/10.48550/arXiv.2503.01155),[Link](https://arxiv.org/abs/2503.01155)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p3.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Zhao, Y\. Wu, Y\. Yue, T\. Wu, Q\. Xu, Y\. Yue, M\. Lin, S\. Wang, Q\. Wu, Z\. Zheng, and G\. Huang \(2025\)Absolute zero: reinforced self\-play reasoning with zero data\.External Links:2505\.03335,[Document](https://dx.doi.org/10.48550/arXiv.2505.03335),[Link](https://arxiv.org/abs/2505.03335)Cited by:[§1](https://arxiv.org/html/2605.16727#S1.p2.1),[§14](https://arxiv.org/html/2605.16727#S14.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.16727#S2.SS0.SSS0.Px1.p1.1)\.

\\beginappendix

## 6Extended Related Work

#### Further asymmetric self\-play methods\.

We expand here on the asymmetric self\-play methods only briefly summarised in the main paper\. SPIRAL\(Liuet al\.,[2026](https://arxiv.org/html/2605.16727#bib.bib61)\)extends shared\-policy self\-play to two\-player zero\-sum games \(TicTacToe, Kuhn Poker, Simple Negotiation\) with role\-conditioned advantages: it is genuinely symmetric in the sense that both players sample from the same policy\. SPC\(Chenet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib41)\)trains a step\-level process\-reward critic against a “sneaky generator” that injects subtle reasoning errors; this is a process\-reward\-model training recipe rather than a problem\-solving self\-play setup\. GASP\(Janaet al\.,[2026](https://arxiv.org/html/2605.16727#bib.bib42)\)grounds a coding teacher on real hard “goalpost” problems the student cannot yet solve and asks the teacher to emit easier \(“lemma”\) and harder \(“lift”\) variants bridging the gap: the distinctive mechanism is the use of real problems as anchor points, which neither AZR nor PopuLoRA relies on\. SPICE\(Liuet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib43)\)gives its Challenger document access that the Reasoner lacks; document asymmetry drives the adversarial curriculum\. eva\(Yeet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib59)\)targets*RLHF*\(not RLVR\) by evolving prompt distributions with a regret\-based estimate\-sample\-evolve procedure over a shared policy; the reward is a learned preference model, not a programmatic verifier\.

#### Further LoRA\-space and evolutionary methods\.

LoRAHub\(Huanget al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib57)\)black\-box\-searches mixture coefficients over a library of pre\-trained LoRA experts at test time, without any weight\-space gradient\. X\-LoRA\(Buehler and Buehler,[2024](https://arxiv.org/html/2605.16727#bib.bib58)\)token\-level\-gates between frozen LoRAs at inference, producing dynamic mixtures without touching the adapter weights\. Heterogeneous Swarms\(Fenget al\.,[2025a](https://arxiv.org/html/2605.16727#bib.bib56)\)jointly optimises model roles \(as DAG adjacency\) and weights under PSO\. Promptbreeder\(Fernandoet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib66)\)evolves text artifacts \(prompts\) over a fixed LLM with an analogous fitness\-and\-mutation structure, but does not touch model weights\. ESSA\(Korotyshovaet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib53)\)restricts ES to low\-rank attention adapters compressed via SVD and runs entirely in low\-precision inference, targeting post\-SFT alignment\. ES\-at\-Scale\(Qiuet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib54)\)\(Cognizant\) reports the first successful full\-parameter ES fine\-tuning of multi\-billion\-parameter LLMs, beating PPO/GRPO on sparse\-reward reasoning tasks such as Countdown\. Spherical linear interpolation \(SLERP\) is a standard alternative to linear midpoints when the two merged vectors are near anti\-parallel\. None of these operates inside an online PBT loop over a co\-evolving adversarial partner, which is the regime PopuLoRA targets\.

#### TrueSkill and matchmaking details\.

We use TrueSkill\(Herbrichet al\.,[2006](https://arxiv.org/html/2605.16727#bib.bib67)\), a Bayesian skill\-rating system designed for head\-to\-head matches, to produce a per\-adapter rating for each role at every step\. Each adapter carries a pair\(μ,σ2\)\(\\mu,\\sigma^\{2\}\): after a matchup where teachertt’s problems are solved at rateρ\\rhoby studentss, we record the matchup as a win for whichever role came out above its conditional expectation \(the teacher wins ifρ\\rhois small enough that its problem was too hard for the student on average, and vice versa\), and update\(μ,σ\)\(\\mu,\\sigma\)for both adapters under the standard TrueSkill Bayesian update\. Ratings then drive two things: \(i\)*matchmaking*: we sample the next opponent by prioritised fictitious self\-play \(PFSP\)\(Vinyalset al\.,[2019](https://arxiv.org/html/2605.16727#bib.bib1); Heinrich and Silver,[2016](https://arxiv.org/html/2605.16727#bib.bib14)\), weighting draw probability by the TrueSkill\-predicted win rate, so pairings concentrate on informative near\-balanced matchups rather than mismatches; and \(ii\)*culling*: at evolution time we rank each sub\-population by the TrueSkill*lower\-confidence bound*μ−kσ\\mu\-k\\sigmaand replace the bottom fraction, which penalises low\-μ\\muadapters but also high\-σ\\sigma\(under\-sampled\) ones\.

## 7Hyperparameters

Table[1](https://arxiv.org/html/2605.16727#S7.T1)lists the full configuration used for every population member and the single\-agent baseline\. All values are held identical across arms unless a row explicitly splits them\.

Table 1:Full PopuLoRA hyperparameter configuration\.All rows are shared between the baseline and population runs unless otherwise noted\.#### Per\-adapter compute is identical; wall\-clock scales sub\-linearly thanks to multi\-LoRA batching\.

By design, no hyperparameter is tuned per arm: every population member is configured identically to the single\-agent baseline: same base model, rank, learning rate, rolloutnn, per\-prompt budget, advantage estimator, and training horizon\. At each training step the baseline produces one matchup’s worth of rollouts and one single\-adapter gradient update, while the population producesNTN\_\{T\}matchups and updates allNT\+NSN\_\{T\}\+N\_\{S\}adapters in a single mixed mega\-batch\. Consequently,

rollouts / step \(pop\)=\(NT\+NS\)×rollouts / step \(baseline\),\\displaystyle=\(N\_\{T\}\+N\_\{S\}\)\\times\\text\{rollouts / step \(baseline\)\},grad updates / step \(pop\)=\(NT\+NS\)×grad updates / step \(baseline\)\.\\displaystyle=\(N\_\{T\}\+N\_\{S\}\)\\times\\text\{grad updates / step \(baseline\)\}\.Every adapter in the population therefore sees the same number of tokens, the same rollout distribution size, and the same number of policy updates per step as the baseline agent does\. The only step\-scoped factor that changes between arms is the number of adapters training in parallel, so observed gaps are attributable to population dynamics rather than extra compute per adapter, a larger rollout budget per prompt, or a richer optimiser configuration\.

#### Compute accounting\.

Table[2](https://arxiv.org/html/2605.16727#S7.T2)reports the actual wall\-clock cost of each configuration on the same hardware \(1×\\times8×\\timesH100 node\)\. The vLLM multi\-LoRA scheduler\(Shenget al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib17)\)batches all adapters through a shared base\-model forward pass, so rollout time is dominated by the base model rather than the adapter count\. This gives dramatic sub\-linear wall\-clock scaling: the 4T\+4S population trains8×8\{\\times\}more adapters for only1\.31×1\.31\{\\times\}the wall\-clock, yielding a6\.1×6\.1\{\\times\}throughput gain in adapter\-steps per hour\. All configurations run on identical hardware with no additional nodes\.

Table 2:Compute accounting at 200 training steps\.All runs use the same 1×\\times8×\\timesH100 \(80 GB\) node\. Wall\-clock is the median per\-step time \(excluding the first 5 warm\-up steps\) extrapolated to 200 steps\. The 4T\+4S population trains8×8\{\\times\}more adapters at only1\.31×1\.31\{\\times\}the baseline wall\-clock cost\.
#### Scaling ceiling\.

Scaling the population on a single node is bounded by two resources that grow with the adapter count: per\-adapter activation memory on the backward pass, and collective\-op timeouts when heterogeneous per\-rank work \(dominated by long\-running sandbox validations on some ranks\) blocks the allreduce\. At the population sizes we report, both are manageable with tighter micro\-batching and a raised NCCL watchdog; beyond that, the scaling headroom is set by the interaction between per\-adapter backward memory, the vLLM KV\-cache reservation for rollouts, and rank\-heterogeneous validation time\. Sharding the base across nodes or offloading validation to a dedicated worker pool is the natural next step\.

## 8Full LoRA Operator Catalog

We catalogue all 17 operators implemented in theexperiments/lora\_ops/benchmark: six mutations, nine crossovers, and two identity controls\. Every operator consumes one or two rank\-rrLoRA state dicts and emits a rank\-matched child in seconds, without any retraining\. For mutations we writeΔW=BA⊤\\Delta W=BA^\{\\top\}for a single LoRA module, whereA∈ℝr×dA\\in\\mathbb\{R\}^\{r\\times d\}andB∈ℝd×rB\\in\\mathbb\{R\}^\{d\\times r\}; for crossovers we mark parents with superscriptsΔW\(1\),ΔW\(2\)\\Delta W^\{\(1\)\},\\Delta W^\{\(2\)\}\. The effective\-delta SVD used by several operators is computed via a double\-QR on ther×rr\\times rcore \(see\_efficient\_svd\_of\_BA\), since a full SVD on thed×dd\\times dreconstruction is prohibitively expensive at 7B scale\.

#### M1 \(SVD\-structured mutation\)\.

Preserve learned structure while perturbing it within the singular\-value basis\. ComputeΔW=UΣV⊤\\Delta W=U\\Sigma V^\{\\top\}, perturbΣ\\Sigmamultiplicatively with log\-normal noiseΣ←Σ⊙exp⁡\(ϵz\),z∼𝒩\(0,Ir\)\\Sigma\\leftarrow\\Sigma\\odot\\exp\(\\epsilon\\,z\),\\ z\\sim\\mathcal\{N\}\(0,I\_\{r\}\), and apply first\-order Cayley near\-identity rotationsR=I\+ϵKR=I\+\\epsilon KwithK=\(M−M⊤\)/2,M∼𝒩\(0,Ir×r\)K=\(M\-M^\{\\top\}\)/2,\\ M\\sim\\mathcal\{N\}\(0,I\_\{r\\times r\}\)to bothUUandVV\. Refactor with the balanced splitB′=U′Σ′,A′=Σ′V′⁣⊤B^\{\\prime\}=U^\{\\prime\}\\sqrt\{\\Sigma^\{\\prime\}\},\\ A^\{\\prime\}=\\sqrt\{\\Sigma^\{\\prime\}\}\\,V^\{\\prime\\top\}\. Defaultϵ=0\.1\\epsilon=0\.1\. Note thatRRis only approximately orthogonal: at higher strength the Cayley first\-order approximation drifts\.

#### M2 \(layer\-selective Gaussian\)\.

Uniformly sample a fractionffof LoRA module slots without replacement and add per\-tensor adaptive Gaussian noise to theirAAandBBfactors:A\+=𝒩\(0,\(ϵ⋅std\(A\)\)2\)A\\mathrel\{\+\}=\\mathcal\{N\}\(0,\(\\epsilon\\cdot\\operatorname\{std\}\(A\)\)^\{2\}\)and similarly forBB\. The remaining1−f1\-fof modules are copied verbatim\. Defaultsϵ=0\.1,f=0\.33\\epsilon=0\.1,\\ f=0\.33\. Coherent within each selected module; leaves the majority of the adapter untouched\.

#### M3 \(component masking\)\.

SVD the effective delta, sample⌈ρ⋅r⌉\\lceil\\rho\\cdot r\\rceilrandom singular indices, zero the corresponding singular values, and refactor\. Defaultρ=0\.3\\rho=0\.3\. The indices are*uniformly random*, not bottom\-kk, so the dropped directions are not necessarily the least\-important; effective rank drops byρ\\rho\.

#### M4 \(full Gaussian\)\.

Add per\-tensor adaptive Gaussian noise to everyAAandBBin the adapter:A\+=𝒩\(0,\(ϵ⋅std\(A\)\)2\)A\\mathrel\{\+\}=\\mathcal\{N\}\(0,\(\\epsilon\\cdot\\operatorname\{std\}\(A\)\)^\{2\}\)\. Defaultϵ=0\.15\\epsilon=0\.15, deliberately larger thanM2\. Narrow tensors receive proportionally less noise because the scale is tied to each tensor’s running standard deviation\.

#### M5 \(NEFTune\-style\)\.

Dimension\-aware uniform perturbation on the input factor only, adapted from NEFTune’s\(Jainet al\.,[2024b](https://arxiv.org/html/2605.16727#bib.bib34)\)embedding\-space noise\. ForA∈ℝL×dA\\in\\mathbb\{R\}^\{L\\times d\}drawη∼Uniform\(−α/Ld,α/Ld\)\\eta\\sim\\mathrm\{Uniform\}\\\!\\left\(\-\\alpha/\\sqrt\{Ld\},\\ \\alpha/\\sqrt\{Ld\}\\right\)elementwise and setA′=A\+ηA^\{\\prime\}=A\+\\eta;BBis left untouched\. Defaultα=10\\alpha=10, which gives per\-element noise≈0\.03\\approx 0\.03on Qwen2\.5\-Coder\-7B rank\-32 LoRA\.

#### M6 \(rank perturbation\)\.

Structured rank collapse plus fine\-grained jitter on the survivors, inspired by DyLoRA\(Valipouret al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib35)\)\. SVD the effective delta, zero the bottom\-kksingular values, and multiplicatively perturb the surviving topr−kr\-kvalues by𝒩\(1,σ\)\\mathcal\{N\}\(1,\\sigma\)\. Defaultsk=2,σ=0\.05k=2,\\ \\sigma=0\.05\. Preserves top singular*directions*; this is a real rank collapse unlikeM3’s random masking\.

#### copy\_parent\(identity control\)\.

Deepcopy of the parent state dict\. Used as the ceiling of the mutation experiment: any real mutation should land at or below the first\-step reward of a copy\_parent child under the same retrain schedule\. Also the sensor that caught the FSDP v1load\_lora\_onlysilent\-no\-op bug during Phase 0, when every copy\_parent child was starting at the base\-model reward regardless of parent snapshot, it was immediate evidence that the adapter was never reaching the rollout engine\.

#### X1 \(DARE\)\.

Drop\-and\-rescale recipe ofYuet al\.\([2024](https://arxiv.org/html/2605.16727#bib.bib31)\)\. For each of the four factors\{A\(1\),A\(2\),B\(1\),B\(2\)\}\\\{A^\{\(1\)\},A^\{\(2\)\},B^\{\(1\)\},B^\{\(2\)\}\\\}independently, build a Bernoulli keep mask with probability1−p1\-p, rescale the survivors by1/\(1−p\)1/\(1\-p\)to preserve expectation, and average the two parents\. Defaultp=0\.7p=0\.7\. Because drop\-and\-rescale is applied independently toAAandBB, the effective\-delta expectation contains cross\-parent interaction terms \(𝔼\[BchildAchild⊤\]≠12∑iB\(i\)A\(i\)⊤\\mathbb\{E\}\[B\_\{\\text\{child\}\}A\_\{\\text\{child\}\}^\{\\top\}\]\\neq\\tfrac\{1\}\{2\}\\sum\_\{i\}B^\{\(i\)\}A^\{\(i\)\\top\}\); this is a property of the original DARE formulation\.

#### X2 \(layer\-wise crossover\)\.

For each LoRA module slot, a coin flip selects\(A,B\)\(A,B\)entirely from parent 1 or entirely from parent 2\. This preserves intra\-module coherence \(anA,BA,Bpair is always consistent\) but breaks inter\-module coherence\. Reminiscent of TIES\-style sign\-aligned mergers\(Yadavet al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib32)\)at module granularity\.

#### X3 \(SVD subspace crossover\)\.

Mix principal and secondary singular subspaces\. SVD both parents, drawk∼Uniform\{1,…,r−1\}k\\sim\\mathrm\{Uniform\}\\\{1,\\dots,r\-1\\\}, and build the child by concatenating columns:Uchild=\[U:,:k\(1\)∣U:,k:r\(2\)\]U\_\{\\text\{child\}\}=\[U^\{\(1\)\}\_\{:,:k\}\\mid U^\{\(2\)\}\_\{:,k:r\}\], withΣchild\\Sigma\_\{\\text\{child\}\}andVchildV\_\{\\text\{child\}\}similarly composed\. ReconstructB′,A′B^\{\\prime\},A^\{\\prime\}from the composed factors\. The concatenatedUchildU\_\{\\text\{child\}\}andVchildV\_\{\\text\{child\}\}are*not themselves orthonormal*because they mix columns from two different orthonormal bases; the result is a valid rank\-rrLoRA but not the SVD of any single matrix\.

#### X4 \(extrapolative\)\.

Linear combination past parent 2 along the parent difference vector, analogous to task arithmetic\(Ilharcoet al\.,[2023](https://arxiv.org/html/2605.16727#bib.bib33)\)with a coefficient beyond one\. Sampleη∼Uniform\(ηmin,ηmax\)\\eta\\sim\\mathrm\{Uniform\}\(\\eta\_\{\\min\},\\eta\_\{\\max\}\)once per call and setAchild=A\(1\)\+η\(A\(2\)−A\(1\)\)A\_\{\\text\{child\}\}=A^\{\(1\)\}\+\\eta\\,\(A^\{\(2\)\}\-A^\{\(1\)\}\), similarly forBB\. Default\(ηmin,ηmax\)=\(1\.0,1\.5\)\(\\eta\_\{\\min\},\\eta\_\{\\max\}\)=\(1\.0,1\.5\), so the child always lies beyond parent 2 along the parent\-difference direction\.

#### X5 \(task\-arithmetic linear merge\)\.

Tensor\-wise convex combination:child\[k\]=αP\(1\)\[k\]\+\(1−α\)P\(2\)\[k\]\\text\{child\}\[k\]=\\alpha\\,P^\{\(1\)\}\[k\]\+\(1\-\\alpha\)\\,P^\{\(2\)\}\[k\]applied to every tensor\. Defaultα=0\.5\\alpha=0\.5\. Deterministic \(norngdraws\); this is the canonical linear merge ofIlharcoet al\.\([2023](https://arxiv.org/html/2605.16727#bib.bib33)\)\.

#### X6 \(TIES merge\)\.

Sign\-aligned merge ofYadavet al\.\([2023](https://arxiv.org/html/2605.16727#bib.bib32)\)\. Per tensor: \(1\)*trim*by zeroing the lowestτ\\taufraction of elements per parent by magnitude, \(2\)*elect*the consensus sign as the sign of the sum over trimmed parents, \(3\)*disjoint merge*by averaging only the parents whose sign agrees with the elected sign at that position; positions where no parent agrees output zero\. Defaultτ=0\.2\\tau=0\.2\. Deterministic\. For two parents this degenerates to picking the larger\-magnitude survivor when the signs disagree\.

#### X7 \(DELLA\)\.

Magnitude\-proportional drop with DARE\-style rescale, inspired by DELLA\-Merging\(Deepet al\.,[2024](https://arxiv.org/html/2605.16727#bib.bib36)\)\. Per tensor and per parent independently: flatten and rank by\|⋅\|\|\\cdot\|descending, assign drop probabilityp\(r\)=ε\+\(1−ε\)r/\(n−1\)p\(r\)=\\varepsilon\+\(1\-\\varepsilon\)\\,r/\(n\-1\)so the largest element has drop probabilityε\\varepsilonand the smallest has drop probability11, apply independent Bernoulli keeps, rescale survivors by1/\(1−p\(r\)\)1/\(1\-p\(r\)\), and average the two processed parents\. Defaultε=0\.1\\varepsilon=0\.1\. Usesrng, so different seeds produce different children\.

#### X8 \(SLERP\)\.

Spherical linear interpolation on the flattened\-tensor vector\. Per tensor, flatten both parents toa,ba,band computecos⁡θ=⟨a,b⟩/\(‖a‖‖b‖\)\\cos\\theta=\\langle a,b\\rangle/\(\\\|a\\\|\\\|b\\\|\)clipped to\(−1,1\)\(\-1,1\)\. The merged vector ismerged=sin⁡\(\(1−t\)θ\)sin⁡θa\+sin⁡\(tθ\)sin⁡θb\\text\{merged\}=\\tfrac\{\\sin\(\(1\-t\)\\theta\)\}\{\\sin\\theta\}\\,a\+\\tfrac\{\\sin\(t\\theta\)\}\{\\sin\\theta\}\\,b\. Falls back to linear interpolation when either norm orsin⁡θ\\sin\\thetais near\-zero\. Defaultt=0\.5t=0\.5\. Deterministic\. Useful when a linear midpoint would overshrink the result because the parents are nearly anti\-parallel\.

#### X9 \(Fisher\-weighted merge\)\.

Data\-free Fisher\-weighted averaging in the spirit ofMatena and Raffel \([2022](https://arxiv.org/html/2605.16727#bib.bib37)\):child=\(F\(1\)p\(1\)\+F\(2\)p\(2\)\)/\(F\(1\)\+F\(2\)\+ε\)\\text\{child\}=\(F^\{\(1\)\}\\,p^\{\(1\)\}\+F^\{\(2\)\}\\,p^\{\(2\)\}\)/\(F^\{\(1\)\}\+F^\{\(2\)\}\+\\varepsilon\)per element, withF\(i\)=\(p\(i\)\)2F^\{\(i\)\}=\(p^\{\(i\)\}\)^\{2\}as a data\-free proxy for the diagonal Fisher\. The original formulation requires a gradient pass on a calibration set; ourp2p^\{2\}proxy preserves the large\-magnitude\-wins behaviour without that cost\. Deterministic\.

#### linear\_0\_5\(plain\-mean control\)\.

Wrapper aroundX5withα=0\.5\\alpha=0\.5hard\-coded under a distinct registry name\. Serves as the crossover counterpart tocopy\_parent: any informative crossover should outperform a plain tensor\-wise midpoint average under identical retrain conditions\.

#### Which operators ship live\.

Only the first subset \{M1, M2, M3, M4, X1, X2, X3, X4\} is active inside the live population described in §[3\.4](https://arxiv.org/html/2605.16727#S3.SS4)\. The remaining nine operators \(M5, M6,copy\_parent, X5, X6, X7, X8, X9,linear\_0\_5\) are characterised in isolation via theexperiments/lora\_ops/benchmark \(see §[15](https://arxiv.org/html/2605.16727#S15)\) but did not enter the main self\-play loop\.

## 9Sample Generated Problems over Training

Figure[8](https://arxiv.org/html/2605.16727#S9.F8)pairs one baseline\-generated and one population\-generated problem from matched training steps, drawn from the saved per\-step problem archives\. Picks are deterministic: at each step we take a median\-complexity quality\-1\.0 problem, subject to a loose line\-count bound so snippets fit the figure; at step 100 we additionally report the most trivial quality\-1\.0 baseline problem to illustrate the mode\-collapse endpoint the baseline drifts into\. Across all three rows, the baseline’s outputs shrink to increasingly vacuous programs, while the population continues producing substantive code across the three AZR task types\.

Baseline AZR, step 5mixed

def f\(s: str\) \-\> bool:

n = len\(s\)

for i in range\(n//2\):

if s\[i\] \!= s\[n\-1\-i\]:

return False

return True

AST depth 9 • cyclomatic 3\.0 • 6 LOC

PopuLoRA teacher, step 5infer\-input

def f\(arr: list\[int\]\) \-\> list\[int\]:

result = \[\]

index = 0

while index < len\(arr\):

result\.append\(str\(arr\[index\]\)\)

index \+= 2\*\*\(len\(arr\) \- index \- 1\)

return result

print\(f\(\[1, 2, 3, 4, 5, 6, 7\]\)\)

AST depth 9 • cyclomatic 2\.0 • 8 LOC

Baseline AZR, step 50mixed

def f\(input\_string: str\) \-\> str:

vowels = set\(’aeiouAEIOU’\)

result = ’’

for char in input\_string:

if char not in vowels:

result \+= char

return result

AST depth 6 • cyclomatic 3\.0 • 7 LOC

PopuLoRA teacher, step 50infer\-output

def f\(nums: list\[int\]\) \-\> list\[int\]:

sorted\_nums = sorted\(nums\)

result = \[\]

for num in sorted\_nums:

digit\_sum = sum\(int\(digit\) for digit in str\(num\)\)

modified\_num = num \+ digit\_sum

result\.append\(modified\_num\)

return result

AST depth 9 • cyclomatic 3\.0 • 8 LOC

Baseline AZR, step 100mixed

def f\(number: int\) \-\> int:

return number \* 3

AST depth 5 • cyclomatic 1\.0 • 2 LOC

PopuLoRA teacher, step 100infer\-function

def f\(data: list\[dict\[str, int\]\]\) \-\> int:

state\_sum = 0

for entry in data:

if ’value’ in entry:

state\_sum \+= entry\[’value’\]

return state\_sum

AST depth 8 • cyclomatic 3\.0 • 6 LOC

Figure 8:Generator outputs at matched training steps\.Left: baseline AZR; right: one PopuLoRA teacher\. By step 100 the baseline has collapsed toreturn number \* 3, while the population is still producing programs with branching and state\.
## 10Problem\-Space Coverage

#### Structural complexity metrics\.

The four metrics tracked in §[4\.4](https://arxiv.org/html/2605.16727#S4.SS4)\(and used as the descriptor for the CVT archive below\) are computed from the Python AST of each teacher\-generated program\.AST depthis the maximum nesting depth of the parsed syntax tree\.Cyclomatic complexityis McCabe’s count of linearly independent paths through the program, computed as1\+\#\{if,elif,for,while,except,and,or,assert\}1\+\\\#\\\{\\texttt\{if\},\\texttt\{elif\},\\texttt\{for\},\\texttt\{while\},\\texttt\{except\},\\texttt\{and\},\\texttt\{or\},\\texttt\{assert\}\\\}\.Lines of codecounts non\-blank, non\-comment source lines, andvariable countis the number of distinct identifiers introduced as variables\.

Figure[9](https://arxiv.org/html/2605.16727#S10.F9)plots the fraction of CVT archive grid tiles filled by validated teacher\-generated problems, as training progresses\. We report this quantity as a percent: population coverage is read directly fromarchive/coverage\_pct, and the baseline proxy divides its runningarchive/total\_problemsby the grid\-cell budget of 4096 tiles\. Baseline coverage plateaus early and stays flat; population coverage keeps expanding through the full 200 steps, consistent with the complexity\-growth picture in the main text \(§[4\.4](https://arxiv.org/html/2605.16727#S4.SS4)\)\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x8.png)Figure 9:Problem\-space coverage\.CVT archive grid coverage \(percent of the 4096\-cell budget\)\. Baseline \(black\) vs population \(blue\)\.

## 11Per\-Type Training Dynamics

Figure[10](https://arxiv.org/html/2605.16727#S11.F10)disaggregates the main\-text training dynamics \(Figure[3](https://arxiv.org/html/2605.16727#S4.F3)\) by AZR task type\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x9.png)Figure 10:Per\-type breakdown of Figure[3](https://arxiv.org/html/2605.16727#S4.F3)\.Rows:code\_i,code\_o,code\_f\. Columns: solve rate, validity rate\. Baseline \(black\) vs\. population mean \(blue\) with per\-member spread\.
## 12Solve Rate by Problem Type

Figure[11](https://arxiv.org/html/2605.16727#S12.F11)isolates the solver’s solve rate for each of the three AZR task types\. The baseline reaches near\-perfect solve rate on all three types, consistent with self\-calibration to easy problems\. The population’s solve rate oscillates on each type, with the oscillation frequency varying across types \(fastest on output prediction, slowest on induction\), matching the per\-type dynamics in Figure[10](https://arxiv.org/html/2605.16727#S11.F10)\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x10.png)Figure 11:Solve rate by problem type\.Baseline \(black\) vs\. population \(blue\) with per\-member spread\. The baseline saturates on all three types; the population oscillates as teachers co\-adapt\.
## 13Cross\-Evaluation over Training

Figure[12](https://arxiv.org/html/2605.16727#S13.F12)shows per\-student solve\-rate profiles against each teacher at five equispaced training snapshots\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x11.png)Figure 12:Cross\-evaluation over training\.Parallel coordinates at five equispaced snapshots\. Each vertical axis is one teacher; each line is one student’s solve rate against that teacher\. Flat lines indicate uniform solvers; kinks indicate specialisation against particular teachers\.
## 14Extended Downstream Evaluation

In the main paper \(Figure[2](https://arxiv.org/html/2605.16727#S4.F2)\) we report both the population mean and the per\-benchmark best for each role\. Table[3](https://arxiv.org/html/2605.16727#S14.T3)provides the numeric complement to the figure\.

Table 3:Downstream pass@1 \(%\)\.Numeric complement to Figure[2](https://arxiv.org/html/2605.16727#S4.F2)\. Mean±\\pmstd across adapters within each role\. Worst/Best are the single weakest/strongest adapter \(per\-benchmark selection on the test set\)\. Best value per column inbold\. The 4T\+4S Worst adapter \(43\.6% aggregate\) still outperforms the baseline \(42\.3%\), indicating that co\-evolution lifts the entire population rather than concentrating gains in a few members\.Here we provide two complementary views\.

#### Per\-adapter breakdown\.

Figures[13](https://arxiv.org/html/2605.16727#S14.F13)and[14](https://arxiv.org/html/2605.16727#S14.F14)disaggregate the 4T\+4S and 8T\+8S populations into individual adapters\. The per\-adapter view shows that all population members are fairly competent across the board: the spread within each role is modest, indicating that co\-evolution produces a population of generally strong reasoners rather than narrow specialists\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x12.png)Figure 13:Pass@1 for each of the 4 teachers and 4 students from the 4T\+4S population\. The main text \(Figure[2](https://arxiv.org/html/2605.16727#S4.F2)\) reports the population mean and per\-benchmark best; here we show all individual adapters\.![Refer to caption](https://arxiv.org/html/2605.16727v1/x13.png)Figure 14:Pass@1 for each of the 8 teachers and 8 students from the 8T\+8S population\.
#### Comparison with full\-finetune AZR\.

Figure[15](https://arxiv.org/html/2605.16727#S14.F15)adds the publicly released Baseline AZR checkpoint\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib28)\), which was produced by full fine\-tuning \(not LoRA\) and trained for 300 gradient steps, 50% more than the 200 steps used by both the Baseline AZR \(LoRA\) and the PopuLoRA population members\. The comparison is therefore not per\-adapter compute\-matched: the full\-finetune baseline has both more parameter updates and full\-rank gradient access\. Despite this advantage, the best PopuLoRA student matches or exceeds the full\-finetune baseline on the majority of benchmarks, particularly on LiveCodeBench and the competition\-level math benchmarks \(AIME, AMC\)\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x14.png)Figure 15:Downstream pass@1 including the full\-finetune Baseline AZR \(300 gradient steps, non\-LoRA\)\. Compare with Figure[2](https://arxiv.org/html/2605.16727#S4.F2), which uses only the per\-adapter compute\-matched LoRA baseline\.

## 15Fulllora\_opsBenchmark Results

The figures below come from the isolated operator\-evaluation sweep atexperiments/lora\_ops/\. For each operator, we apply it to a frozen parent adapter at five snapshot steps \(5, 10, 25, 50, 100\); the child then re\-trains for 50 steps on the parent’s task and its pass\-rate curve is recorded\. The main\-paper Figure[6](https://arxiv.org/html/2605.16727#S4.F6)shows the subset of four mutation and four crossover operators that we ship in the live population; here we report the full operator catalog\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x15.png)Figure 16:Mutation\-operator retention across snapshot steps\.Rows: mutation operators M1–M6 pluscopy\_parentcontrol\. Columns: snapshot steps \(10, 25, 50, 100\)\. Parent’s 100\-step learning curve is drawn in grey, and the child’s 50\-step retraining curve in colour, with the child’s x\-axis offset by the snapshot step so both live on the same global\-step scale\.![Refer to caption](https://arxiv.org/html/2605.16727v1/x16.png)Figure 17:Crossover\-operator retention across snapshot steps\.Same layout as the mutation figure: rows are X1–X9 plus thelinear\_0\_5plain\-average control; columns are snapshot steps \(10, 25, 50, 100\)\. Parents fromexp\_c1task\-merging sweep in grey, child retraining in colour with the snapshot\-step offset\.
## 16Training Diagnostics

Figure[18](https://arxiv.org/html/2605.16727#S16.F18)shows three standard policy\-gradient diagnostics for both arms: actor gradient norm, policy\-gradient loss, and entropy of the actor’s output distribution\. All three metrics track within the usual range for stable training; the population’s per\-member spread reflects the independent adapters training in parallel\. We omit a KL panel because we train without a KL penalty \(kl\_loss\_coef=0\\texttt\{kl\\\_loss\\\_coef\}=0,kl\_ctrl\.kl\_coef=0\\texttt\{kl\\\_ctrl\.kl\\\_coef\}=0\), so the KL term is identically zero for every step of both arms\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x17.png)Figure 18:Training diagnostics\.Gradient norm, policy\-gradient loss, entropy\. Baseline \(black\) vs population mean \(blue\) with per\-member spread \(light blue\)\.
## 17Response Length over Training

Figure[19](https://arxiv.org/html/2605.16727#S17.F19)tracks the mean response length \(in tokens\) over training\. The baseline’s responses shorten steadily as the agent self\-calibrates to trivial problems that require only short programs and short solutions\. The population shows the opposite trend: response length grows throughout training, reaching roughly3×3\{\\times\}the baseline’s by the end of the run\. Longer responses have been associated with more elaborate reasoning in prior RLVR work\(Guoet al\.,[2025](https://arxiv.org/html/2605.16727#bib.bib39)\); here the effect is driven by teachers generating increasingly complex problems that demand longer programs and more detailed solutions\.

![Refer to caption](https://arxiv.org/html/2605.16727v1/x18.png)Figure 19:Response length over training\.Baseline \(black\) collapses to short responses \(∼250\{\\sim\}250tokens\); population \(blue\) grows to∼1000\{\\sim\}1000tokens as problem complexity increases\.
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

Similar Articles

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Skill Self-Play: Pushing the Frontier of LLM Capability with Co-Evolving Skills

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

LLM-as-a-Tutor: Policy-Aware Prompt Adaptation for Non-Verifiable RL

Submit Feedback

Similar Articles

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
Skill Self-Play: Pushing the Frontier of LLM Capability with Co-Evolving Skills
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning
LLM-as-a-Tutor: Policy-Aware Prompt Adaptation for Non-Verifiable RL