Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games

arXiv cs.AI 06/10/26, 04:00 AM Papers
co-evolution llm adversarial-games strategy-evolution code-evolution multi-agent capture-the-flag
Summary
This paper proposes three co-evolutionary mechanisms (evaluator co-evolution, hierarchical deep evaluation, and weakness pressure) for LLM-driven code evolution in adversarial multi-agent games, achieving state-of-the-art results on the MCTF 2026 maritime capture-the-flag task.
arXiv:2606.10389v1 Announce Type: new Abstract: Recent advances in LLM-driven code evolution have enabled automated discovery by iteratively generating and improving programs. However, applying these methods to adversarial multi-agent games introduces a fundamental challenge: the evaluation landscape shifts as strategies improve, causing fixed evaluators to become unreliable and evolution to stagnate. We propose three mechanisms to address this challenge: evaluator co-evolution, which incorporates discovered champions into the opponent pool; hierarchical deep evaluation, which replaces noisy few-game scores with statistically reliable assessments; and weakness pressure, which dynamically up-weights the most difficult opponents to break through plateaus. We implement these mechanisms within FAMOU, a framework built upon the same foundation-model code-evolution paradigm as OpenEvolve and ShinkaEvolve. On the MCTF 2026 3v3 maritime capture-the-flag task, FAMOU consistently outperforms both baselines under two backbone LLMs, achieving the highest combined score (0.526) and the best generalization to unseen opponents (61.7% win rate), while ablations confirm that each mechanism contributes to performance. Notably, the LLM mutation process generates tactical structures entirely absent from the seed strategies -- including lookahead search and adaptive interception -- demonstrating that code-level evolution can produce nontrivial algorithmic innovations in adversarial settings. The FAMOU-evolved strategy further achieved 1st place in the hardware round-robin and 3rd in simulation at the AAMAS 2026 MCTF Competition, validating its real-world transferability. The optimized implementation and corresponding evaluation codes developed through our evolutionary process are available at: https://github.com/1xiangliu1/FAMOU-CoEvo
Original Article
View Cached Full Text
Cached at: 06/10/26, 06:14 AM
# Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games
Source: [https://arxiv.org/html/2606.10389](https://arxiv.org/html/2606.10389)
Haoran Li1, Zengle Ge2, Ziyang Zhang2, Xiaomin Yuan2, Yui Lo3, Qianhui Liu4, Bocheng An5, Dongke Rong2, Jiaqun Liu6, Annan Li2, Jianmin Wu2, Dawei Yin2, Dou Shen2,†

###### Abstract

Recent advances in LLM\-driven code evolution have enabled automated discovery by iteratively generating and improving programs\. However, applying these methods to adversarial multi\-agent games introduces a fundamental challenge: the evaluation landscape shifts as strategies improve, causing fixed evaluators to become unreliable and evolution to stagnate\. We propose three mechanisms to address this challenge:evaluator co\-evolution, which incorporates discovered champions into the opponent pool;hierarchical deep evaluation, which replaces noisy few\-game scores with statistically reliable assessments; andweakness pressure, which dynamically up\-weights the most difficult opponents to break through plateaus\.

We implement these mechanisms within FAMOU, a framework built upon the same foundation\-model code\-evolution paradigm as OpenEvolve and ShinkaEvolve\. On the MCTF 2026 3v3 maritime capture\-the\-flag task, FAMOU consistently outperforms both baselines under two backbone LLMs, achieving the highest combined score \(0\.526\) and the best generalization to unseen opponents \(61\.7% win rate\), while ablations confirm that each mechanism contributes to performance\. Notably, the LLM mutation process generates tactical structures entirely absent from the seed strategies—including lookahead search and adaptive interception—demonstrating that code\-level evolution can produce nontrivial algorithmic innovations in adversarial settings\. The FAMOU\-evolved strategy further achieved 1st place in the hardware round\-robin and 3rd in simulation at the AAMAS 2026 MCTF Competition, validating its real\-world transferability\. The optimized implementation and corresponding evaluation codes developed through our evolutionary process are available at:https://github\.com/1xiangliu1/FAMOU\-CoEvo

## Introduction

### Background and Motivation

Adversarial multi\-agent games remain central in AI research\(Busoniuet al\.[2008](https://arxiv.org/html/2606.10389#bib.bib1); Hernandez\-Lealet al\.[2017](https://arxiv.org/html/2606.10389#bib.bib2)\), combining non\-stationary opponents, exponentially growing joint action spaces, and non\-transitive dominance relations\(Czarnecki and others[2020](https://arxiv.org/html/2606.10389#bib.bib3)\)\. These challenges are especially pronounced in team\-based games where agents must coordinate within their team while competing against adversaries\.

We study these challenges through MCTF 2026 \(Maritime Capture The Flag\), a 3v3 maritime capture\-the\-flag competition on a 160 m×\\times80 m field where teams are ranked by total captures\. In our early competition attempts, the reinforcement learning policies we trained failed to outperform handwritten heuristic strategies\. This motivated a different approach: instead of learning policies from scratch, we use LLM\-driven code\-level evolution to automatically improve the heuristic code itself, preserving its structural advantages while pushing performance beyond what manual design can achieve\.

### Code\-Level Evolution Approach

Inspired by FunSearch\(Romera\-Paredeset al\.[2024](https://arxiv.org/html/2606.10389#bib.bib4)\), ELM\(Lehman and others[2022](https://arxiv.org/html/2606.10389#bib.bib5)\), and Famou\-Agent\(Liet al\.[2025](https://arxiv.org/html/2606.10389#bib.bib6)\), we search directly over complete strategy code \(500–1700 lines\) by using LLM\-generated semantic mutations and evaluator\-driven selection with co\-evolutionary dynamics\. Unlike end\-to\-end reinforcement learning, this approach preserves the interpretable structure of heuristic code while allowing the LLM to introduce new tactical logic absent from the seeds—such as lookahead search or dynamic role auctions\.

### Contributions

This paper makes the following contributions:

1. 1\.A systematic framework comparison for LLM code\-level evolution in adversarial games\.We compare FAMOU with OpenEvolve\(Sharma[2025](https://arxiv.org/html/2606.10389#bib.bib8)\)and ShinkaEvolve\(Langeet al\.[2025](https://arxiv.org/html/2606.10389#bib.bib9)\)across two backbone LLMs using standardized MCTF evaluation\. FAMOU consistently outperforms both baselines\.
2. 2\.Evaluator co\-evolution with deep evaluation and weakness pressure\.We introduce three mechanisms—evaluator co\-evolution \(incorporating discovered champions into the opponent pool\), hierarchical deep evaluation \(replacing noisy few\-game scores with statistically reliable assessments\), and weakness pressure \(dynamically up\-weighting the most difficult opponents\)—and assess their contributions through exploratory ablations\.
3. 3\.Empirical evidence of LLM\-generated tactical structures\.We document LLM\-generated tactical structures absent from the seed code, including H\-DWA \(lookahead search\), A\-Lock \(role locking\), and K\-Filter \(EWMA\-based interception\), providing evidence that LLMs can generate nontrivial algorithmic structures in adversarial games\.

## Related Work

### Multi\-Agent Adversarial Games

Research on multi\-agent adversarial games spans heuristic and learning\-based paradigms\(Busoniuet al\.[2008](https://arxiv.org/html/2606.10389#bib.bib1); Hernandez\-Lealet al\.[2017](https://arxiv.org/html/2606.10389#bib.bib2)\)\. Heuristic strategies \(e\.g\., hierarchical state machines\(Kalyanakrishnan and others[2007](https://arxiv.org/html/2606.10389#bib.bib10)\), artificial potential fields\(Khatib[1986](https://arxiv.org/html/2606.10389#bib.bib11)\)\) offer natural advantages in determinism and interpretability\. Deep RL systems such as AlphaStar\(Vinyals and others[2019](https://arxiv.org/html/2606.10389#bib.bib12)\), OpenAI Five\(Berner and others[2019](https://arxiv.org/html/2606.10389#bib.bib13)\), and DeepMind CTF\(Jaderberg and others[2019](https://arxiv.org/html/2606.10389#bib.bib14)\), together with multi\-agent algorithms such as QMIX\(Rashid and others[2018](https://arxiv.org/html/2606.10389#bib.bib15)\)and MAPPO\(Yu and others[2022](https://arxiv.org/html/2606.10389#bib.bib16)\)and paradigms such as fictitious self\-play\(Heinrichet al\.[2015](https://arxiv.org/html/2606.10389#bib.bib17)\), PSRO\(Lanctot and others[2017](https://arxiv.org/html/2606.10389#bib.bib18)\), and population\-based training\(Jaderberg and others[2017](https://arxiv.org/html/2606.10389#bib.bib19)\), have achieved superhuman performance in complex games\. However, the inherent non\-transitivity of adversarial games\(Czarnecki and others[2020](https://arxiv.org/html/2606.10389#bib.bib3); Balduzzi and others[2019](https://arxiv.org/html/2606.10389#bib.bib20)\)makes it difficult to find globally robust strategies\.

### LLM\-Driven Code Generation and Program Synthesis

From Codex\(Chen and others[2021](https://arxiv.org/html/2606.10389#bib.bib21)\)to DeepSeek\-Coder\(Guo and others[2024](https://arxiv.org/html/2606.10389#bib.bib22)\), the code\-generation capabilities of LLMs have advanced rapidly\(Li and others[2022](https://arxiv.org/html/2606.10389#bib.bib23),[2023](https://arxiv.org/html/2606.10389#bib.bib24); Roziere and others[2023](https://arxiv.org/html/2606.10389#bib.bib25)\)\. Programming agents such as SWE\-agent\(Yang and others[2024](https://arxiv.org/html/2606.10389#bib.bib26)\)and Voyager\(Wang and others[2023](https://arxiv.org/html/2606.10389#bib.bib27)\)demonstrate that LLMs can perform multi\-step software\-engineering and open\-world exploration tasks\. At the intersection of LLMs and evolutionary search, FunSearch\(Romera\-Paredeset al\.[2024](https://arxiv.org/html/2606.10389#bib.bib4)\)was the first to combine LLMs with evolutionary search and surpass human\-best solutions in mathematical discovery\. ELM\(Lehman and others[2022](https://arxiv.org/html/2606.10389#bib.bib5)\)proposed LLMs as intelligent mutation operators, ReEvo\(Ye and others[2024](https://arxiv.org/html/2606.10389#bib.bib28)\)introduced reflective mechanisms to enhance heuristic evolution, and Eureka\(Ma and others[2023](https://arxiv.org/html/2606.10389#bib.bib29)\)used LLMs to automatically design RL reward functions\. Famou\-Agent\(Liet al\.[2025](https://arxiv.org/html/2606.10389#bib.bib6)\)combines LLM\-based evolution with evaluation\-feedback loops in scientific computing, serving as the direct foundation for the framework in this paper\.

At the intersection of LLM evolution and multi\-agent games, PolicyEvolve\(Lvet al\.[2025](https://arxiv.org/html/2606.10389#bib.bib7)\)was the first to propose a programmatic strategy evolution framework for multiplayer games, achieving continuous strategy improvement through global/local strategy pools and population\-based training\. COvolve\(Sygkounaset al\.[2026](https://arxiv.org/html/2606.10389#bib.bib30)\)models LLM\-generated strategies and environments as a zero\-sum game, improving strategy robustness through adversarial co\-evolution\.

### Evolutionary Search and Quality\-Diversity

Classical genetic programming \(GP\)\(Koza[1992](https://arxiv.org/html/2606.10389#bib.bib31)\)has a long history in program synthesis tasks\. NEAT\(Stanley and Miikkulainen[2002](https://arxiv.org/html/2606.10389#bib.bib32)\)evolves neural network structures through topology augmentation, and regularized evolution\(Real and others[2019](https://arxiv.org/html/2606.10389#bib.bib33)\)combines tournament selection with age regularization for architecture search\. The island model\(Whitleyet al\.[1999](https://arxiv.org/html/2606.10389#bib.bib34)\)effectively balances exploration and exploitation by maintaining multiple independent subpopulations with periodic elite migration\. MAP\-Elites\(Mouret and Clune[2015](https://arxiv.org/html/2606.10389#bib.bib35)\)and NSGA\-II\(Deb and others[2002](https://arxiv.org/html/2606.10389#bib.bib36)\)further expand the dimensions of quality\-diversity search\.

### Co\-Evolution and Adversarial Evaluation

The central idea of co\-evolution is that test cases and evaluated subjects evolve together, forming a continuously escalating “arms race”\(Hillis[1990](https://arxiv.org/html/2606.10389#bib.bib37)\)\. Rosin and Belew\(Rosin and Belew[1997](https://arxiv.org/html/2606.10389#bib.bib38)\)systematically studied evaluation difficulties in competitive co\-evolution—including the “Red Queen effect” and cyclic dominance—and proposed mitigation methods such as competitive fitness sharing\. Rating systems like Elo\(Elo[1978](https://arxiv.org/html/2606.10389#bib.bib39)\)and TrueSkill\(Herbrichet al\.[2006](https://arxiv.org/html/2606.10389#bib.bib40)\)attempt to infer true strategy strength from limited match data, but the non\-transitivity of adversarial games\(Czarnecki and others[2020](https://arxiv.org/html/2606.10389#bib.bib3)\)makes evaluation based on limited opponent pools inherently unreliable\. The evaluator co\-evolution proposed in this paper can be viewed as an instantiation of competitive co\-evolution in the domain of LLM code evolution\.

Positioning\.Our work differs from prior LLM\-based evolution in three respects: \(1\) the evolution target is a complete 500–1700\-line strategy system rather than a compact function; \(2\) we introduce evaluator co\-evolution, deep evaluation, and weakness pressure to combat opponent\-pool staleness; \(3\) we provide controlled experiments with statistical significance testing across two backbone LLMs\.

## Problem Definition and Challenges

### MCTF 2026 Competition Rules

MCTF 2026 is a 3v3 maritime capture\-the\-flag competition\. The rules are summarized in Table[1](https://arxiv.org/html/2606.10389#Sx3.T1)\.

Table 1:Summary of the MCTF 2026 competition rules\.Participants submit Python strategy code for the blue team, which the platform evaluates against all opponents\. Strategies are ranked by total captures, meaning that a 5–0 victory is far more valuable than a narrow 1–0 win\.

### Core Challenges in Strategy Design

MCTF strategy design faces five core challenges: \(1\)*high\-dimensional action space*:243=13,82424^\{3\}=13\{,\}824joint action combinations; \(2\)*role assignment*: three agents must be dynamically assigned to roles such as attacker, defender, and support; \(3\)*offense\-defense balance*: ranking by total captures requires maximizing scoring efficiency while minimizing concessions; \(4\)*opponent adaptability*: opponent strategies are unknown and diverse, requiring good generalization; \(5\)*non\-transitivity*: no globally optimal strategy exists\.

## Method

### Overall Framework

We use FAMOU \(Framework for Automated Mutation and Optimization of Utilities\)\(Liet al\.[2025](https://arxiv.org/html/2606.10389#bib.bib6)\)to optimize executable MCTF strategy code with LLM\-generated semantic mutations\. Relative to Famou\-Agent\(Liet al\.[2025](https://arxiv.org/html/2606.10389#bib.bib6)\), we adapt the framework to adversarial multi\-agent strategy code and add evaluator co\-evolution, weakness pressure, and MCTF experiments\. The general LLM mutation loop is inherited from Famou\-Agent: the system maintains an archive of executable strategy programs, selects high\-scoring programs as parents, prompts the LLM to produce complete modified Python files rather than patches, validates the generated files for syntax and API compatibility, evaluates valid candidates in the game simulator, and feeds the resulting performance summary back into the next mutation prompt\. Figure[1](https://arxiv.org/html/2606.10389#Sx4.F1)shows the architecture, and Algorithm[1](https://arxiv.org/html/2606.10389#alg1)gives the core process\. The remainder of this section details the key components: the evolution target and mutation operator \(§4\.2\), parent selection \(§4\.3\), and the three mechanisms that distinguish FAMOU from vanilla LLM evolution—evaluator co\-evolution \(§4\.4\), weakness pressure \(§4\.5\), and hierarchical deep evaluation \(§4\.6\)\.

![Refer to caption](https://arxiv.org/html/2606.10389v1/figures/fig_framework_v14.png)Figure 1:FAMOU self\-evolving coding\-agent framework\. Seed strategies undergo LLM\-based semantic mutation and evolution; an evaluator screens candidates; hierarchical deep evaluation selects champions; champions are automatically added to the next evaluator’s opponent pool through co\-evolution; and weakness pressure dynamically adjusts opponent weights to overcome plateaus\.Algorithm 1Main Loop for MCTF Strategy EvolutionRequire:Seed strategyS0S\_\{0\}, evaluatorℰ\\mathcal\{E\}, number of iterationsTTEnsure:Best strategyS∗S^\{\*\}

1:Initialize population

P0P\_\{0\}containing

S0S\_\{0\}variants

2:for

t=1t=1to

TTdo

3:foreach candidate

c∈Ptc\\in P\_\{t\}do

4:Run

ccagainst each opponent in

ℰ\\mathcal\{E\}\(3 games/opponent\); compute weighted fitness

F\(c\)=∑i=1nwi⋅metric\(c,oi\)F\(c\)=\\sum\_\{i=1\}^\{n\}w\_\{i\}\\cdot\\text\{metric\}\(c,o\_\{i\}\)
5:endfor

6:Select elite individuals based on

F\(c\)F\(c\); LLM performs semantic mutation on elite code

→\\tonew candidates

7:ifchampion detected \(new best\)then

8:Add champion to the evaluator opponent pool \(co\-evolution\); applyweakness pressure: identify and up\-weight weakest opponent

9:Triggerdeep evaluation\(20 games/opponent\) for confirmation

10:endif

11:endfor

12:returnBest strategy

S∗S^\{\*\}from deep evaluation

### Evolution Target and LLM Mutation Operator

Each individual is a 500–1700\-line Python strategy file with anAgent\_0\.compute\_action\(obs, info\)method\. Mutations are performed by LLMs, specifically Gemini\-2\.5\-Flash or DeepSeek\-V4\-Flash, with temperature 0\.8 and a maximum token budget of 64,000\. For each mutation, the LLM receives the current strategy, task/API constraints, and per\-opponent evaluation feedback, then returns a complete executable strategy file preserving the required interface\.

LLM mutation uses code semantics to make directed structural changes rather than random perturbations\. Evolution starts from four handwritten heuristic seed strategies \(Table[2](https://arxiv.org/html/2606.10389#Sx4.T2)\), covering different tactical styles including potential field avoidance, auction\-based assignment, and lane\-based offense\.

Table 2:Handwritten heuristic seed strategies used to initialize evolution\.
### Parent Selection via UCB

To balance the exploitation of high\-fitness programs with the exploration of under\-tested candidates, we select parent programs for mutation using an Upper Confidence Bound \(UCB\) bandit policy\(Aueret al\.[2002](https://arxiv.org/html/2606.10389#bib.bib42)\):

UCB\(pi\)=s¯i\+c2ln⁡Nni\\text\{UCB\}\(p\_\{i\}\)\\;=\\;\\bar\{s\}\_\{i\}\\;\+\\;c\\,\\sqrt\{\\frac\{2\\,\\ln N\}\{n\_\{i\}\}\}\(1\)wheres¯i\\bar\{s\}\_\{i\}is the mean fitness score of programpip\_\{i\},NNis the total number of parent selections so far,nin\_\{i\}is the number of timespip\_\{i\}has been selected as a parent, andccis the exploration coefficient\. The first term favors high\-performing programs, while the second term provides an exploration bonus that grows with under\-sampling, ensuring that promising but less\-explored programs are revisited\.

### Evaluator Co\-Evolution

Once a parent is selected and mutated, the resulting candidate must be evaluated\. In adversarial games, a fixed evaluator can become stale as strategies improve, so we treat the evaluator and strategies as co\-evolving populations\(Hillis[1990](https://arxiv.org/html/2606.10389#bib.bib37); Rosin and Belew[1997](https://arxiv.org/html/2606.10389#bib.bib38)\)\. We define the evaluator as follows:

Definition 1\(Evaluator\)\. An*evaluator*ℰ=\(O,𝐰,F\)\\mathcal\{E\}=\(O,\\mathbf\{w\},F\)consists of an opponent poolO=\{o1,…,on\}O=\\\{o\_\{1\},\\ldots,o\_\{n\}\\\}, weights𝐰=\[w1,…,wn\]\\mathbf\{w\}=\[w\_\{1\},\\ldots,w\_\{n\}\]\(∑wi=1\\sum w\_\{i\}=1\), and a fitness functionFF:

F\(c\)=∑i=1nwi⋅metric\(c,oi\)F\(c\)=\\sum\_\{i=1\}^\{n\}w\_\{i\}\\cdot\\text\{metric\}\(c,o\_\{i\}\)\(2\)where*metric*denotes either win rate or scoring efficiencymin⁡\(avg\_score/5\.0,1\.0\)\\min\(\\text\{avg\\\_score\}/5\.0,1\.0\)\.

Whenever a new championCtC\_\{t\}is produced, it becomes a high\-weight “gatekeeper” inℰt\+1\\mathcal\{E\}\_\{t\+1\}, so later candidates must outperform both original opponents and all previous champions\.

### Weakness Pressure

Co\-evolution enriches the opponent pool, but does not address which opponents matter most\. When a plateau reflects a persistent weak opponent, we apply weakness pressure to redirect selective pressure:

1. 1\.Perform a comprehensive deep evaluation of the current champion to identify the opponentoweako\_\{\\text\{weak\}\}with the lowest win rate\.
2. 2\.Doubleoweako\_\{\\text\{weak\}\}’s weight and renormalize: wweak←2wweakw\_\{\\text\{weak\}\}\\leftarrow 2\\,w\_\{\\text\{weak\}\}\(3\)wj←wj⋅1−2wweak1−wweak∀j≠weakw\_\{j\}\\leftarrow w\_\{j\}\\cdot\\frac\{1\-2\\,w\_\{\\text\{weak\}\}\}\{1\-w\_\{\\text\{weak\}\}\}\\;\\;\\forall\\,j\\neq\\text\{weak\}\(4\)

This makes improvement against the weakest opponent the main path to higher fitness\.

### Hierarchical Deep Evaluation

Both co\-evolution and weakness pressure depend on accurate fitness estimates\. However, fast evaluation scores can be unreliable: in preliminary experiments, 3\-game scores had Spearmanρ=0\.11\\rho=0\.11,p=0\.69p=0\.69with 20–40\-game deep\-evaluation scores\. This motivates a hierarchical system:

- •Fast evaluation\(3 games/opponent\): for coarse population ranking during evolution\.
- •Deep evaluation\(20 games/opponent\): for true score confirmation after candidate extraction\.

Final decisions use deep evaluation; 3\-game scores provide only coarse population ranking\.

## Experiments and Analysis

We conduct a systematic empirical study addressing three research questions:

- •RQ1: Does FAMOU with the proposed mechanisms \(evaluator co\-evolution, deep evaluation, weakness pressure\) outperform other code\-level evolution frameworks \(OpenEvolve, ShinkaEvolve\) for adversarial games?
- •RQ2: What is the contribution of each proposed mechanism \(evaluator co\-evolution, deep evaluation, weakness pressure\)?
- •RQ3: How well do evolved strategies generalize to unseen opponents?

### Experimental Setup

#### Experiment Configurations\.

We run six three\-run main experiments and four single\-run ablations, each for 400 iterations from the four seed strategies in Table[2](https://arxiv.org/html/2606.10389#Sx4.T2)\. All three frameworks \(FAMOU, OpenEvolve, ShinkaEvolve\) are tested under two backbone LLMs—DeepSeek\-V4\-Flash and Gemini\-2\.5\-Flash—with the same seed strategies, iteration budget, and final benchmark protocol\. The experiments are organized into four groups:

- •Full framework\(2 experiments×\\times3 runs\): FAMOU with all three mechanisms\.
- •Baseline: OpenEvolve\(2 experiments×\\times3 runs\): OpenEvolve\(Sharma[2025](https://arxiv.org/html/2606.10389#bib.bib8)\), a state\-of\-the\-art LLM code evolution framework using single\-population evolution with a fixed evaluator\.
- •Baseline: ShinkaEvolve\(2 experiments×\\times3 runs\): ShinkaEvolve\(Langeet al\.[2025](https://arxiv.org/html/2606.10389#bib.bib9)\), an LLM code evolution framework emphasizing sample efficiency and open\-ended search, employing island\-model populations, novelty rejection sampling, and adaptive LLM ensemble selection\.
- •Ablation\(4 experiments, single run\): FAMOU under Gemini\-2\.5\-Flash with one mechanism removed at a time\. - –w/o Co\-evolution: remove evaluator co\-evolution \(fixed opponent pool\) - –w/o Deep Eval: remove hierarchical deep evaluation \(use only 3\-game scores\) - –w/o Weakness: remove weakness pressure \(uniform opponent weights\) - –Vanilla: remove all three mechanisms \(basic LLM evolution with fixed evaluator\)

The key difference is that neither OpenEvolve nor ShinkaEvolve adds evolved champions to the evaluator opponent pool, applies weakness\-pressure reweighting, or uses hierarchical deep evaluation during evolution\.

#### Benchmark Evaluation Protocol\.

The main comparison is equal\-iteration \(T=400T=400\), not equal\-compute, because FAMOU adds deep\-evaluation and co\-evolution overhead\. At every 40\-iteration checkpoint, all experiments use the same benchmark:

- •10 benchmark opponents: five seen \(hard, Balanced, Agile, Structured, Safe\) plus five held\-out unseen opponents used only for checkpoint logging and final evaluation\.
- •20 games per opponent: sufficient to reduce noise while remaining computationally feasible\.

TheCombined Score \(CS\)aggregates win rate and scoring margin:

CS=0\.7×WR\+0\.3×min⁡\(1\.0,max⁡\(0,margin\)5\.0\)\\text\{CS\}=0\.7\\times\\text\{WR\}\+0\.3\\times\\min\\\!\\Big\(1\.0,\\;\\frac\{\\max\(0,\\;\\text\{margin\}\)\}\{5\.0\}\\Big\)\(5\)where WR denotes win rate and margin denotes the average score differential\. The 0\.7/0\.3 weighting prioritizes reliable wins while rewarding high\-margin scoring; the margin term is clipped at 5\.0\.

#### Statistical Testing\.

We use the following statistical tests to assess significance:

- •Wilcoxon signed\-rank test: based on per\-opponent win rate differences \(paired, non\-parametric\)\.
- •Pairedtt\-test: on per\-opponent margin differences\.
- •Bootstrap 95% confidence intervals\(10,000 resamples\): for Combined Score\.

### Main Results: FAMOU vs\. Baselines \(RQ1\)

Table[3](https://arxiv.org/html/2606.10389#Sx5.T3)reports final iteration\-400 performance\.

Table 3:Final performance at iteration 400 \(mean±\\pmstandard deviation over 3 runs, 10 benchmark opponents, 20 games per opponent\)\. CS = Combined Score\. The best result in each column is shown inbold\.The key findings are as follows:

- •FAMOU outperforms both baselines across both LLM backbones\.Under both DeepSeek\-V4\-Flash and Gemini\-2\.5\-Flash, FAMOU achieves higher CS, WR, and margin than ShinkaEvolve and OpenEvolve\. The best configuration—FAMOU under DeepSeek\-V4\-Flash—reaches CS=0\.526=0\.526, a relative improvement of 20\.4% over the strongest baseline, ShinkaEvolve under the same backbone \(CS=0\.437=0\.437\)\.
- •LLM backbone matters\.DeepSeek\-V4\-Flash consistently outperforms Gemini\-2\.5\-Flash across all three frameworks, suggesting that stronger code\-generation capability translates directly into better evolved strategies\.
- •FAMOU generalizes better to unseen opponents\.Under DeepSeek\-V4\-Flash, FAMOU achieves the highest unseen win rate \(0\.6170\.617\), while ShinkaEvolve drops sharply from0\.7800\.780seen to0\.3420\.342unseen, suggesting that co\-evolutionary evaluation reduces overfitting to training opponents\.
- •FAMOU sustains productive search\.Over 400 iterations, FAMOU under DeepSeek\-V4\-Flash discovers 10 unique champions with a 22\.5% stagnation rate, compared to only 3 champions and 65% stagnation for OpenEvolve under Gemini\-2\.5\-Flash, indicating that the co\-evolutionary mechanisms maintain selective pressure against diminishing returns\.

### Learning Curves

As shown in Figure[2](https://arxiv.org/html/2606.10389#Sx5.F2), FAMOU variants continue improving through iteration 400, whereas baselines largely stagnate after iteration 240–280\.

![Refer to caption](https://arxiv.org/html/2606.10389v1/x1.png)Figure 2:Combined Score learning curves over 400 iterations, evaluated every 40 iterations on the 10\-opponent benchmark with 20 games per opponent\. Lines show the mean over 3 runs, and shaded regions show±\\pmone standard deviation\. FAMOU variants show sustained improvement, whereas the baselines plateau early\.#### Evolution Trajectory Analysis\.

Figure[3](https://arxiv.org/html/2606.10389#Sx5.F3)plots best\-program scores every 10 iterations for all 6 FAMOU runs, revealing*punctuated equilibrium*—long stagnation interrupted by sudden breakthroughs\. For example, under DeepSeek\-V4\-Flash, FAMOU Run 1 jumps from 0\.377 to 0\.617 at iteration 200, a\+63\.7%\+63\.7\\%increase in a single step\. Under Gemini\-2\.5\-Flash, FAMOU Run 3 shows three consecutive breakthroughs between iterations 340–380, increasing rapidly from 0\.54 to 0\.77\. The timing of these breakthroughs varies widely across runs \(iterations 80 to 370\), reflecting the inherent stochasticity of LLM\-driven code generation\. Several breakthroughs occur after iteration 300, indicating that 400 iterations may be insufficient for full convergence\.

![Refer to caption](https://arxiv.org/html/2606.10389v1/x2.png)Figure 3:FAMOU evolution trajectories based on the best score at 10\-iteration resolution\. All runs start from the same four\-seed initialization; the best initial seed has CS=0\.145=0\.145\. Stars \(⋆\\star\) mark breakthroughs, defined as score increases greater than 0\.05\. Evolution exhibits punctuated equilibrium, with long plateaus interrupted by sudden jumps\.

### Ablation Study \(RQ2\)

Table[4](https://arxiv.org/html/2606.10389#Sx5.T4)reports the ablation results, and Figure[4](https://arxiv.org/html/2606.10389#Sx5.F4)shows the corresponding learning curves\.

Table 4:Exploratory ablation study \(Gemini\-2\.5\-Flash backbone, single run\)\.Δ\\DeltaCS is relative to the FAMOU Full configuration\.pp\-values are from Wilcoxon signed\-rank tests on per\-opponent win rates\. The best CS is shown inbold\. Because each ablation was run once, results should be interpreted as suggestive rather than conclusive\.![Refer to caption](https://arxiv.org/html/2606.10389v1/x3.png)Figure 4:Ablation learning curves \(Gemini\-2\.5\-Flash, single run\)\. Each panel isolates one mechanism by comparing FAMOU Full \(red\) against the ablated variant \(dashed\) and Vanilla baseline \(gray\)\. Removing deep evaluation causes the largest degradation\.- •Removing deep evaluation produced the largest drop\(ΔCS=−0\.136\\Delta\\text\{CS\}=\-0\.136,p=0\.002p=0\.002\), consistent with the unreliability of 3\-game selection\.
- •The full configuration outperforms the vanilla baseline\(ΔCS=\+0\.089\\Delta\\text\{CS\}=\+0\.089,p=0\.020p=0\.020\), suggesting interaction among the three mechanisms\.
- •The mechanisms exhibit non\-additive interactions\.The sum of individual drops \(0\.071\+0\.136\+0\.097=0\.3040\.071\+0\.136\+0\.097=0\.304\) exceeds the gap between the full configuration and the vanilla baseline \(0\.089\), indicating that the effects of the mechanisms are not additive and that mechanisms may partially compensate for one another when combined\.

### Cross\-Evaluation Analysis

To assess robustness beyond fixed\-opponent benchmarks, we conduct a6×66\\times 6round\-robin cross\-evaluation among the best strategies produced by each method \(20 games per directed matchup, 1,800 total games across 3 runs; Figure[5](https://arxiv.org/html/2606.10389#Sx5.F5)\)\. Under DeepSeek\-V4\-Flash, FAMOU achieves the highest average win rate \(0\.698\), confirming the benchmark ranking\. Notably, the cross\-evaluation reveals that benchmark scores do not fully capture pairwise relationships: OpenEvolve under DeepSeek\-V4\-Flash outperforms ShinkaEvolve in direct matchups despite lower benchmark scores, suggesting different specialization profiles that a diverse opponent pool can expose\.

![Refer to caption](https://arxiv.org/html/2606.10389v1/x4.png)Figure 5:Cross\-evaluation win\-rate matrix \(6×66\\times 6, averaged over 3 runs with 20 games per directed matchup\)\. Rows indicate the blue player, columns indicate the red opponent, and each cell reports the row strategy’s win rate against the column strategy\. FAMOU variants achieve the strongest average performance and consistently outperform the baseline variants\.
### Generalization Analysis \(RQ3\)

Figure[6](https://arxiv.org/html/2606.10389#Sx5.F6)compares seen vs\. unseen opponent performance:

- •Under DeepSeek\-V4\-Flash, FAMOU achieves the best generalization with the smallest seen–unseen gap \(0\.743±0\.1410\.743\\pm 0\.141vs\.0\.617±0\.0720\.617\\pm 0\.072\), and lies closest to the diagonal\.
- •Under the same backbone, ShinkaEvolve has high seen performance \(0\.7800\.780\) but a large generalization gap \(unseen=0\.342=0\.342\), indicating overfitting to training opponents\.
- •OpenEvolve under Gemini\-2\.5\-Flash shows relatively low performance on both seen and unseen opponents \(0\.4520\.452vs\.0\.2620\.262\)—it does not overfit because it never learns strong strategies\.

The per\-opponent breakdown in Figure[7](https://arxiv.org/html/2606.10389#Sx5.F7)further confirms this trend: under DeepSeek\-V4\-Flash, FAMOU achieves\>70%\>70\\%win rate against eight of the ten opponents, while OpenEvolve under Gemini\-2\.5\-Flash struggles against most unseen opponents \(many<30%<30\\%\)\.

![Refer to caption](https://arxiv.org/html/2606.10389v1/x5.png)Figure 6:Generalization analysis on the 10\-opponent benchmark\. Each point represents one method’s mean win rate over 3 runs against the five seen training opponents \(x\-axis\) and the five held\-out unseen opponents \(y\-axis\)\. Points below the diagonal indicate lower performance on unseen opponents than on seen opponents\.![Refer to caption](https://arxiv.org/html/2606.10389v1/x6.png)Figure 7:Per\-opponent win rates on the 10\-opponent benchmark, averaged over 3 runs with 20 games per opponent\. The left five are seen training opponents; the right five are held\-out unseen opponents\.
### LLM\-Invented Tactical Innovations

During evolution, the LLM generated decision architectures absent from the seeds \(Table[5](https://arxiv.org/html/2606.10389#Sx5.T5)\)\.

Table 5:Representative tactical mechanisms introduced by the LLM mutation process during evolution\. None of these mechanisms existed in the seed strategies \(×\\times\)\.For example, H\-DWA adapts the Dynamic Window Approach\(Foxet al\.[1997](https://arxiv.org/html/2606.10389#bib.bib41)\)to discrete single\-step lookahead by evaluating 23 movement headings over a 1\.5\-second horizon—a mechanism entirely absent from the seeds\.

## Conclusion

We extended FAMOU\(Liet al\.[2025](https://arxiv.org/html/2606.10389#bib.bib6)\), a self\-evolving coding\-agent framework, to adversarial multi\-agent games by co\-evolving the evaluation process alongside the strategies themselves\. Our experiments on 3v3 maritime capture\-the\-flag demonstrate that FAMOU consistently outperforms both OpenEvolve and ShinkaEvolve under two backbone LLMs, with the strongest variant achieving a 68\.0% win rate across ten benchmark opponents\. Beyond performance gains, the evolution process generates tactical structures—such as lookahead search \(H\-DWA\) and EWMA\-based interception \(K\-Filter\)—entirely absent from the seed strategies, providing evidence that LLMs can serve as directed mutation operators capable of producing nontrivial algorithmic innovations\.

The practical impact of these results is further validated in the AAMAS 2026 MCTF Competition, where the FAMOU\-evolved strategy placed1st in the hardware round\-robinand3rd in simulation, demonstrating effective sim\-to\-real transfer\.

A key insight from our work is that*evaluator design may matter more than seed quality*: the same starting strategies produce substantially different outcomes depending on whether the evaluation process adapts alongside the evolving population\. In adversarial settings where the opponent landscape shifts continuously, static evaluation leads to overfitting and premature convergence, while co\-evolutionary evaluation maintains selective pressure toward general robustness\.

Limitations\.Our experiments are conducted on a single game domain \(MCTF\), and ablation experiments use single runs; multi\-run ablations and additional game domains are needed for stronger conclusions\. Game simulation dominates runtime \(each full benchmark checkpoint requires∼\{\\sim\}10 hours\), and over 30,000 LLM calls represent a non\-negligible computational cost\.

Future directions\.Extending FAMOU to other adversarial tasks \(e\.g\., RoboCup, StarCraft micromanagement\) and incorporating multi\-objective Pareto\-frontier evaluation are important next steps\. More broadly, these results suggest that the co\-evolution of evaluation criteria and solution candidates—a principle well\-established in evolutionary computation—deserves renewed attention in the era of LLM\-driven program synthesis\.

## References

- Finite\-time analysis of the multiarmed bandit problem\.Machine Learning47\(2–3\),pp\. 235–256\.Cited by:[Parent Selection via UCB](https://arxiv.org/html/2606.10389#Sx4.SSx3.p1.7)\.
- D\. Balduzziet al\.\(2019\)Open\-ended learning in symmetric zero\-sum games\.InICML,pp\. 434–443\.Cited by:[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1)\.
- C\. Berneret al\.\(2019\)Dota 2 with large scale deep reinforcement learning\.External Links:1912\.06680Cited by:[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1)\.
- L\. Busoniu, R\. Babuska, and B\. De Schutter \(2008\)A comprehensive survey of multiagent reinforcement learning\.IEEE Trans\. Syst\. Man Cybern\. C38\(2\),pp\. 156–172\.Cited by:[Background and Motivation](https://arxiv.org/html/2606.10389#Sx1.SSx1.p1.1),[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1)\.
- M\. Chenet al\.\(2021\)Evaluating large language models trained on code\.External Links:2107\.03374Cited by:[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p1.1)\.
- W\. M\. Czarneckiet al\.\(2020\)Real world games look like spinning tops\.InNeurIPS,Cited by:[Background and Motivation](https://arxiv.org/html/2606.10389#Sx1.SSx1.p1.1),[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1),[Co\-Evolution and Adversarial Evaluation](https://arxiv.org/html/2606.10389#Sx2.SSx4.p1.1)\.
- K\. Debet al\.\(2002\)A fast and elitist multiobjective genetic algorithm: NSGA\-II\.IEEE Trans\. Evol\. Comput\.6\(2\),pp\. 182–197\.Cited by:[Evolutionary Search and Quality\-Diversity](https://arxiv.org/html/2606.10389#Sx2.SSx3.p1.1)\.
- A\. E\. Elo \(1978\)The rating of chessplayers, past and present\.Arco Pub\.\.Cited by:[Co\-Evolution and Adversarial Evaluation](https://arxiv.org/html/2606.10389#Sx2.SSx4.p1.1)\.
- D\. Fox, W\. Burgard, and S\. Thrun \(1997\)The dynamic window approach to collision avoidance\.IEEE Robotics & Automation Magazine4\(1\),pp\. 23–33\.Cited by:[LLM\-Invented Tactical Innovations](https://arxiv.org/html/2606.10389#Sx5.SSx7.p2.1)\.
- D\. Guoet al\.\(2024\)DeepSeek\-Coder: when the large language model meets programming – the rise of code intelligence\.External Links:2401\.14196Cited by:[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p1.1)\.
- J\. Heinrich, M\. Lanctot, and D\. Silver \(2015\)Fictitious self\-play in extensive\-form games\.InICML,pp\. 805–813\.Cited by:[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1)\.
- R\. Herbrich, T\. Minka, and T\. Graepel \(2006\)TrueSkill: a Bayesian skill rating system\.InNeurIPS,Cited by:[Co\-Evolution and Adversarial Evaluation](https://arxiv.org/html/2606.10389#Sx2.SSx4.p1.1)\.
- P\. Hernandez\-Leal, M\. Kaisers, T\. Baarslag, and E\. M\. de Cote \(2017\)A survey of learning in multiagent environments: dealing with non\-stationarity\.External Links:1707\.09183Cited by:[Background and Motivation](https://arxiv.org/html/2606.10389#Sx1.SSx1.p1.1),[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1)\.
- W\. D\. Hillis \(1990\)Co\-evolving parasites improve simulated evolution as an optimization procedure\.Physica D42\(1–3\),pp\. 228–234\.Cited by:[Co\-Evolution and Adversarial Evaluation](https://arxiv.org/html/2606.10389#Sx2.SSx4.p1.1),[Evaluator Co\-Evolution](https://arxiv.org/html/2606.10389#Sx4.SSx4.p1.1)\.
- M\. Jaderberget al\.\(2017\)Population based training of neural networks\.External Links:1711\.09846Cited by:[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1)\.
- M\. Jaderberget al\.\(2019\)Human\-level performance in 3d multiplayer games with population\-based reinforcement learning\.Science364\(6443\),pp\. 859–865\.Cited by:[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1)\.
- S\. Kalyanakrishnanet al\.\(2007\)Half field offense in RoboCup soccer: a multiagent reinforcement learning case study\.InRoboCup,Cited by:[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1)\.
- O\. Khatib \(1986\)Real\-time obstacle avoidance for manipulators and mobile robots\.Int\. J\. Robotics Research5\(1\),pp\. 90–98\.Cited by:[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1)\.
- J\. R\. Koza \(1992\)Genetic programming: on the programming of computers by means of natural selection\.MIT Press\.Cited by:[Evolutionary Search and Quality\-Diversity](https://arxiv.org/html/2606.10389#Sx2.SSx3.p1.1)\.
- M\. Lanctotet al\.\(2017\)A unified game\-theoretic approach to multiagent reinforcement learning\.InNeurIPS,pp\. 4190–4203\.Cited by:[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1)\.
- R\. T\. Lange, Y\. Imajuku, and E\. Cetin \(2025\)ShinkaEvolve: towards open\-ended and sample\-efficient program evolution\.External Links:2509\.19349Cited by:[item 1](https://arxiv.org/html/2606.10389#Sx1.I1.i1.p1.1),[3rd item](https://arxiv.org/html/2606.10389#Sx5.I5.i3.p1.1)\.
- J\. Lehmanet al\.\(2022\)Evolution through large models\.External Links:2206\.08896Cited by:[Code\-Level Evolution Approach](https://arxiv.org/html/2606.10389#Sx1.SSx2.p1.1),[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p1.1)\.
- A\. Li, C\. Wu, Z\. Ge, Y\. H\. Chong, Z\. Hou,et al\.\(2025\)The FM agent\.External Links:2510\.26144Cited by:[Code\-Level Evolution Approach](https://arxiv.org/html/2606.10389#Sx1.SSx2.p1.1),[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p1.1),[Overall Framework](https://arxiv.org/html/2606.10389#Sx4.SSx1.p1.1),[Conclusion](https://arxiv.org/html/2606.10389#Sx6.p1.1)\.
- R\. Liet al\.\(2023\)StarCoder: may the source be with you\!\.External Links:2305\.06161Cited by:[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p1.1)\.
- Y\. Liet al\.\(2022\)Competition\-level code generation with AlphaCode\.Science378,pp\. 1092–1097\.Cited by:[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p1.1)\.
- M\. Lv, H\. Liu, Z\. Luo, H\. Zhang, and J\. Ou \(2025\)PolicyEvolve: evolving programmatic policies by LLMs for multi\-player games via population\-based training\.External Links:2509\.06053Cited by:[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p2.1)\.
- Y\. J\. Maet al\.\(2023\)Eureka: human\-level reward design via coding large language models\.External Links:2310\.12931Cited by:[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p1.1)\.
- J\. Mouret and J\. Clune \(2015\)Illuminating search spaces by mapping elites\.External Links:1504\.04909Cited by:[Evolutionary Search and Quality\-Diversity](https://arxiv.org/html/2606.10389#Sx2.SSx3.p1.1)\.
- T\. Rashidet al\.\(2018\)QMIX: monotonic value function factorisation for deep multi\-agent reinforcement learning\.InICML,pp\. 4295–4304\.Cited by:[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1)\.
- E\. Realet al\.\(2019\)Regularized evolution for image classifier architecture search\.InAAAI,Vol\.33,pp\. 4780–4789\.Cited by:[Evolutionary Search and Quality\-Diversity](https://arxiv.org/html/2606.10389#Sx2.SSx3.p1.1)\.
- B\. Romera\-Paredes, M\. Barekatain, A\. Novikov, M\. Balog, M\. P\. Kumar, E\. Dupont, F\. J\. R\. Ruiz, J\. S\. Ellenberg, P\. Wang, O\. Fawzi, P\. Kohli, and A\. Fawzi \(2024\)Mathematical discoveries from program search with large language models\.Nature625,pp\. 468–475\.Cited by:[Code\-Level Evolution Approach](https://arxiv.org/html/2606.10389#Sx1.SSx2.p1.1),[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p1.1)\.
- C\. D\. Rosin and R\. K\. Belew \(1997\)New methods for competitive co\-evolution\.Evolutionary Computation5\(1\),pp\. 1–29\.Cited by:[Co\-Evolution and Adversarial Evaluation](https://arxiv.org/html/2606.10389#Sx2.SSx4.p1.1),[Evaluator Co\-Evolution](https://arxiv.org/html/2606.10389#Sx4.SSx4.p1.1)\.
- B\. Roziereet al\.\(2023\)Code Llama: open foundation models for code\.External Links:2308\.12950Cited by:[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p1.1)\.
- A\. Sharma \(2025\)OpenEvolve: an open\-source evolutionary coding agent\.Note:https://github\.com/codelion/openevolveCited by:[item 1](https://arxiv.org/html/2606.10389#Sx1.I1.i1.p1.1),[2nd item](https://arxiv.org/html/2606.10389#Sx5.I5.i2.p1.1)\.
- K\. O\. Stanley and R\. Miikkulainen \(2002\)Evolving neural networks through augmenting topologies\.Evolutionary Computation10\(2\),pp\. 99–127\.Cited by:[Evolutionary Search and Quality\-Diversity](https://arxiv.org/html/2606.10389#Sx2.SSx3.p1.1)\.
- A\. Sygkounas, R\. Hazra, A\. Persson, P\. Zuidberg Dos Martires, and A\. Loutfi \(2026\)COvolve: adversarial co\-evolution of large\-language\-model\-generated policies and environments via two\-player zero\-sum game\.External Links:2603\.28386Cited by:[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p2.1)\.
- O\. Vinyalset al\.\(2019\)Grandmaster level in StarCraft II using multi\-agent reinforcement learning\.Nature575,pp\. 350–354\.Cited by:[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1)\.
- G\. Wanget al\.\(2023\)Voyager: an open\-ended embodied agent with large language models\.External Links:2305\.16291Cited by:[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p1.1)\.
- D\. Whitley, S\. Rana, and R\. Heckendorn \(1999\)The island model genetic algorithm: on separability, population size and convergence\.J\. Comput\. Inf\. Technol\.7\(1\),pp\. 33–47\.Cited by:[Evolutionary Search and Quality\-Diversity](https://arxiv.org/html/2606.10389#Sx2.SSx3.p1.1)\.
- J\. Yanget al\.\(2024\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.External Links:2405\.15793Cited by:[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p1.1)\.
- H\. Yeet al\.\(2024\)Large language models as hyper\-heuristics with reflective evolution\.InNeurIPS,Cited by:[LLM\-Driven Code Generation and Program Synthesis](https://arxiv.org/html/2606.10389#Sx2.SSx2.p1.1)\.
- C\. Yuet al\.\(2022\)The surprising effectiveness of PPO in cooperative multi\-agent games\.InNeurIPS,Cited by:[Multi\-Agent Adversarial Games](https://arxiv.org/html/2606.10389#Sx2.SSx1.p1.1)\.

Appendix: Supplementary Material

## Appendix AMCTF Task Specification

### A\.1Game Environment

MCTF 2026 is a 3v3 maritime capture\-the\-flag game played on a 160 m×\\times80 m rectangular field\. Table[6](https://arxiv.org/html/2606.10389#A1.T6)summarizes the environment parameters\.

Table 6:MCTF 2026 environment specification\.
### A\.2Agent Interface

Each strategy implements anAgent\_0class inheriting fromBaseAgentPolicy, with a single decision method:

classAgent\_0\(BaseAgentPolicy\):

defcompute\_action\(self,obs,info\)\-\>int:

### A\.3Observation Space

The global state dictionary \(info\[self\.id\]\["global\_state"\]\) provides the following keys:

Table 7:Observation space: per\-agent and global keys\.KeyTypeDescription*Per\-agent keys*\(indexed by\(agent\_id, key\)\):pos\[x,y\]\[x,y\]Position coordinatesheadingfloatCurrent heading \(degrees\)speedfloatCurrent speedhas\_flagboolCarrying enemy flagon\_sideboolIn own territoryis\_taggedboolCurrently tagged \(respawning\)is\_disabledboolDisabled by PowerPlaytagging\_cooldownfloatSeconds until can tag again*Global keys*:blue\_flag\_pos\[x,y\]\[x,y\]Blue flag positionred\_flag\_pos\[x,y\]\[x,y\]Red flag positionblue\_team\_scorefloatBlue team total scorered\_team\_scorefloatRed team total score
### A\.4Action Space

The action space consists of 24 discrete actions \(Table[8](https://arxiv.org/html/2606.10389#A1.T8)\)\. Actions 0–22 move at speed 1\.0 with the corresponding heading offset; action 23 stops the agent\.

Table 8:Discrete action space: 23 heading actions \+ stop\.The following navigation helpers are provided to all strategies:

\_HEADINGS=\[\-180,\-120,\-90,\-60,\-30,\-15,\-10,\-5,

\-3,\-2,\-1,0,1,2,3,5,10,15,30,

60,90,120,180\]

defbearing\_to\_action\(rel\_bearing,speed=1\.0\):

ifspeed==0:return23

rb=rel\_bearing

best\_idx,best\_diff=0,float\(’inf’\)

fori,hinenumerate\(\_HEADINGS\):

diff=abs\(\(rb\-h\+540\.0\)%360\.0\-180\.0\)

ifdiff<best\_diff:

best\_diff,best\_idx=diff,i

returnbest\_idx

defnavigate\_to\(target,my\_pos,my\_heading\):

d=np\.asarray\(target\)\-np\.asarray\(my\_pos\)

ab=math\.degrees\(math\.atan2\(d\[0\],d\[1\]\)\)

returnbearing\_to\_action\(

\(ab\-my\_heading\+180\)%360\-180\)

## Appendix BEvolution Configuration

### B\.1FAMOU Hyperparameters

Table[9](https://arxiv.org/html/2606.10389#A2.T9)lists the hyperparameters used for all FAMOU experiments\.

Table 9:FAMOU hyperparameter configuration\.ParameterValueParameterValue*Evolution*Max iterations400Checkpoint interval40Strategyadaptive\_clusterSeed42*Island model*Num islands4Island size50Migration topologyringMigration interval15Migration size2Reset interval60*Co\-evolution \(epoch\)*Epoch interval50Deep eval top\-kk3Deep eval games20/oppWeakness pressuretrue*LLM configuration*Backbone \(config 1\)Gemini\-2\.5\-FlashBackbone \(config 2\)DeepSeek\-V4\-FlashTemperature0\.8Max tokens64,000Timeout300 sMutation typesfull, cross*Fast evaluation*Games per opponent3Match duration600 s

### B\.2Benchmark Opponent Pool

Table[10](https://arxiv.org/html/2606.10389#A2.T10)lists the 10 fixed benchmark opponents used for checkpoint evaluation throughout all experiments\.

Table 10:Benchmark opponent pool \(fixed across all experiments\)\.IDTypeApprox\. strengthDescription*Seen during evolution \(training opponents\):*hardBuilt\-in—Platform default hard botBalancedHeuristic∼100%\{\\sim\}100\\%Role assignment \+ potential fieldAgileHeuristic∼95%\{\\sim\}95\\%Distributed auction \+ vortexStructuredHeuristic∼90%\{\\sim\}90\\%Three\-lane \+ zone defenseSafeHeuristic∼85%\{\\sim\}85\\%Beam search return*Unseen \(held\-out for evaluation only\):*ReactiveHeuristic∼0\.53\{\\sim\}0\.53Auction \+ heat\-driven urgencyOrbitalHeuristic∼0\.73\{\\sim\}0\.73Tangential sliding \+ density lanesVortexHeuristic∼0\.80\{\\sim\}0\.80Adaptive vortex \+ tactical handoffForcefieldHeuristic∼0\.92\{\\sim\}0\.92Quadratic intercept \+ social forceEliteHeuristic∼0\.96\{\\sim\}0\.96P\-Flow \+ multi\-factor adaptive

## Appendix CLLM Mutation Prompts

This section presents the prompt templates used to instruct the backbone LLM during evolution\.

### C\.1Task Description Prompt

The following task description is included in every mutation prompt, providing the LLM with game rules and interface specification:

OBJECTIVE:

DesignaheuristicpolicythatMAXIMIZESAVERAGE

CAPTURESPERGAMEinMCTF2026\.

\*\*\*RANKINGisbyTOTALCAPTURES\-\-notwin/loss\!\*\*\*

A5\-0winis5xmorevaluablethan1\-0win\.

KEYINSIGHT\-\-SCORINGEFFICIENCY:

1\.MINIMIZEWASTEDTIME:

\-Shortestpathtoenemyflagandback

\-Afterscoring,immediatelycyclenextattacker

\-Respawningwastes~15s;avoidunnecessarytags

2\.SMARTROLEALLOCATION:

\-2attackers\+1defenderisasolidbaseline

\-DuringPOWERPLAY\(3v2\):shiftto3attackers

3\.EFFICIENTFLAGRUNS:

\-GotoNEARESTscoringcorner\(0,0\)or\(0,80\)

\-Uselaneswitchingtoavoiddefenders

4\.COORDINATIONOVERRAWSPEED:

\-Oneagentdrawingdefenderswhileanothergrabs

GAMERULES:

FIELD:160mx80m\.Bluex<80,Redx\>80\.

BlueFlag:\(0,40\)\|RedFlag:\(160,40\)

ScoringZones:20mradiusaround\(0,0\)and\(0,80\)\.

TAG:10minownterritory\-\>respawn\(60scooldown\)

FLAGGRAB:within10mofenemyflag\(\+0\.1pts\)

CAPTURE:flagtoowncorner\(\+1\.0pts\)

POWERPLAY:~20%time,oneenemydisabled

INTERFACE:

Class:Agent\_0\(inheritsBaseAgentPolicy\)

Method:compute\_action\(obs,info\)\-\>int\(0\-23\)

Actions0\-22:Speed1\.0,headingoffsets

Action23:Stop

HARDREQUIREMENTS:purePython\+numpyonly\.

### C\.2Full Rewrite Mutation Prompt

For structural changes, the LLM is asked to produce a complete rewritten strategy file\. We use five prompt variants, sampled stochastically:

1. 1\.Default: “Rewrite the program to improve its performance on the specified metrics\.”
2. 2\.Different algorithm: “Design a completely different algorithm approach to solve the same problem\.”
3. 3\.Context\-motivated: “Create a novel algorithm that draws inspiration from the provided context programs but implements a fundamentally different approach\.”
4. 4\.Structural redesign: “Redesign the program with a focus on restructuring the core algorithmic components\.”
5. 5\.Parametric: “Focus on tuning constants, thresholds, and parameters while keeping the overall algorithmic structure\.”

All variants require the LLM to preserve theEVOLVE\-BLOCK\-START/ENDmarkers and maintain the same input/output interface\.

## Appendix DSeed Strategy Code

Listing[1](https://arxiv.org/html/2606.10389#LST1)shows the complete seed strategy used to initialize all experiments\. This simple heuristic implements a 2\-attacker \+ 1\-defender scheme with no evasion, no coordination, and no advanced tactics—intentionally providing room for the evolutionary algorithm to improve\.

Listing 1:Complete seed strategy \(seed\_simple\.py, 217 lines\)\. Role assignment is static \(first agent = defender\), attackers navigate directly to the enemy flag, and the defender patrols near its own flag\.importmath

importnumpyasnp

frompyquaticus\.base\_policies\.base\_policyimportBaseAgentPolicy

\_HEADINGS=\[\-180,\-120,\-90,\-60,\-30,\-15,\-10,\-5,

\-3,\-2,\-1,0,1,2,3,5,10,15,30,

60,90,120,180\]

BLUE\_CORNERS=\[np\.array\(\[0\.,0\.\]\),np\.array\(\[0\.,80\.\]\)\]

BLUE\_FLAG\_HOME=np\.array\(\[0\.0,40\.0\]\)

RED\_FLAG\_HOME=np\.array\(\[160\.0,40\.0\]\)

FIELD\_W,FIELD\_H=160\.0,80\.0

TAG\_RANGE=10\.0

defangle180\(a\):

a=a%360\.0

returna\-360\.0ifa\>180\.0elsea

defbearing\_to\_action\(rel\_bearing\):

rel\_bearing=angle180\(rel\_bearing\)

best\_idx,best\_diff=0,float\(’inf’\)

fori,hinenumerate\(\_HEADINGS\):

d=abs\(angle180\(rel\_bearing\-h\)\)

ifd<best\_diff:

best\_diff,best\_idx=d,i

returnbest\_idx

defnav\_to\(target,my\_pos,my\_heading\):

d=np\.asarray\(target\)\-np\.asarray\(my\_pos\)

ifnp\.linalg\.norm\(d\)<0\.5:

return23

abs\_bear=math\.degrees\(

math\.atan2\(d\[0\],d\[1\]\)\)%360\.0

rel\_bear=angle180\(abs\_bear\-my\_heading\)

returnbearing\_to\_action\(rel\_bear\)

defnearest\_corner\(pos,corners\):

returnmin\(corners,

key=lambdac:np\.linalg\.norm\(c\-np\.asarray\(pos\)\)\)

classAgent\_0\(BaseAgentPolicy\):

def\_\_init\_\_\(self,id,env\):

super\(\)\.\_\_init\_\_\(id,env\)

self\.team\_ids=\[\]

self\.enemy\_ids=\[\]

self\.is\_blue=True

self\.my\_corners=BLUE\_CORNERS

self\.enemy\_flag\_home=RED\_FLAG\_HOME

self\.my\_flag\_home=BLUE\_FLAG\_HOME

self\.role="attacker"

self\.\_initialized=False

def\_init\_once\(self,gs\):

ifself\.\_initialized:return

idx=int\(self\.id\.split\(’\_’\)\[1\]\)

ifidx<3:

self\.is\_blue=True

self\.team\_ids=\[f’agent\_\{i\}’foriinrange\(3\)\]

self\.enemy\_ids=\[f’agent\_\{i\}’foriinrange\(3,6\)\]

else:

self\.is\_blue=False

self\.team\_ids=\[f’agent\_\{i\}’foriinrange\(3,6\)\]

self\.enemy\_ids=\[f’agent\_\{i\}’foriinrange\(3\)\]

self\.role="defender"ifself\.id==self\.team\_ids\[0\]\\

else"attacker"

self\.\_initialized=True

defcompute\_action\(self,obs,info\):

gs=info\.get\(self\.id,\{\}\)\.get\(’global\_state’,\{\}\)

ifnotgs:return23

self\.\_init\_once\(gs\)

my\_pos=np\.asarray\(gs\.get\(\(self\.id,’pos’\)\),dtype=float\)

my\_heading=float\(gs\.get\(\(self\.id,’heading’\),0\.0\)\)

my\_has\_flag=bool\(gs\.get\(\(self\.id,’has\_flag’\),False\)\)

ifgs\.get\(\(self\.id,’is\_tagged’\),False\):return23

ifmy\_has\_flag:

returnnav\_to\(nearest\_corner\(my\_pos,self\.my\_corners\),

my\_pos,my\_heading\)

foreidinself\.enemy\_ids:

ifgs\.get\(\(eid,’has\_flag’\),False\):

epos=gs\.get\(\(eid,’pos’\)\)

ifeposisnotNoneand\\

bool\(gs\.get\(\(self\.id,’on\_side’\),True\)\):

returnnav\_to\(np\.asarray\(epos\),my\_pos,my\_heading\)

enemy\_active=sum\(1foreidinself\.enemy\_ids

ifnotgs\.get\(\(eid,’is\_tagged’\),False\)

andnotgs\.get\(\(eid,’is\_disabled’\),False\)\)

ifenemy\_active<3:

flag=gs\.get\(’red\_flag\_pos’ifself\.is\_blue

else’blue\_flag\_pos’\)

returnnav\_to\(np\.asarray\(flag\)ifflag

elseself\.enemy\_flag\_home,

my\_pos,my\_heading\)

ifself\.role=="defender":

returnself\.\_defend\(gs,my\_pos,my\_heading\)

returnself\.\_attack\(gs,my\_pos,my\_heading\)

def\_attack\(self,gs,my\_pos,my\_heading\):

flag=gs\.get\(’red\_flag\_pos’ifself\.is\_blue

else’blue\_flag\_pos’\)

returnnav\_to\(np\.asarray\(flag\)ifflag

elseself\.enemy\_flag\_home,

my\_pos,my\_heading\)

def\_defend\(self,gs,my\_pos,my\_heading\):

closest,closest\_d=None,float\(’inf’\)

foreidinself\.enemy\_ids:

ifgs\.get\(\(eid,’is\_tagged’\),False\):continue

epos=gs\.get\(\(eid,’pos’\)\)

ifeposisNone:continue

epos=np\.asarray\(epos,dtype=float\)

on\_our=\(epos\[0\]<80\)ifself\.is\_blueelse\(epos\[0\]\>80\)

ifon\_our:

d=np\.linalg\.norm\(epos\-my\_pos\)

ifd<closest\_d:

closest\_d,closest=d,epos

ifclosestisnotNoneandclosest\_d<50\.0:

returnnav\_to\(closest,my\_pos,my\_heading\)

patrol=np\.asarray\(self\.my\_flag\_home\)

offset=np\.array\(\[15\.,0\.\]\)ifself\.is\_blue\\

elsenp\.array\(\[\-15\.,0\.\]\)

patrol=np\.clip\(patrol\+offset,\[2,2\],\[158,78\]\)

ifnp\.linalg\.norm\(patrol\-my\_pos\)<3\.0:

return23

returnnav\_to\(patrol,my\_pos,my\_heading\)

## Appendix EEvolved Champion Strategy \(Excerpt\)

Listing[2](https://arxiv.org/html/2606.10389#LST2)shows an excerpt from a late\-stage evolved champion \(epoch 8, iteration 258\), highlighting three LLM\-generated innovations absent from the seed:

1. 1\.Dynamic role assignment\(\_get\_dynamic\_roles\): Roles are reassigned every tick based on agent proximity to strategic points, replacing the seed’s static first\-agent\-is\-defender scheme\.
2. 2\.Avoidance waypoint calculation\(calculate\_avoidance\_waypoint\): Flag carriers and primary attackers compute temporary waypoints to steer around enemies blocking their path, a mechanism entirely absent from the seed\.
3. 3\.Defender patrol system: The defender cycles through multiple patrol points along the midline \(configurable viaDEFENDER\_PATROL\_Y\_LANES\), rather than camping at a single fixed position\.

The mutation log \(preserved as comments\) documents the LLM’s reasoning:

Listing 2:Evolved champion excerpt \(epoch 8, iteration 258\)\. Comments at the top document the LLM’s mutation reasoning\. Only key evolved components are shown; the full file is∼\{\\sim\}500 lines\.DEFENDER\_PATROL\_X\_OFFSET=10\.0

DEFENDER\_PATROL\_Y\_LANES=\[20\.0,40\.0,60\.0\]

SUPPORT\_ATTACKER\_TRAIL\_DIST=30\.0

AVOID\_RANGE=30\.0

IN\_THE\_WAY\_ANGLE\_THRESHOLD=30\.0

AVOID\_STEER\_ANGLE=45\.0

AVOID\_WAYPOINT\_DIST=50\.0

defcalculate\_avoidance\_waypoint\(

my\_pos,primary\_target,enemy\_pos\):

"""Temporarywaypointtosteeraroundenemy\."""

vec\_to\_enemy=enemy\_pos\-my\_pos

vec\_to\_target=primary\_target\-my\_pos

ifnp\.linalg\.norm\(vec\_to\_target\)<1\.0:

returnprimary\_target

bear\_enemy=math\.degrees\(

math\.atan2\(vec\_to\_enemy\[0\],vec\_to\_enemy\[1\]\)\)

bear\_target=math\.degrees\(

math\.atan2\(vec\_to\_target\[0\],vec\_to\_target\[1\]\)\)

angle\_diff=angle180\(bear\_enemy\-bear\_target\)

ifangle\_diff\>0:

new\_bear=bear\_target\+AVOID\_STEER\_ANGLE

else:

new\_bear=bear\_target\-AVOID\_STEER\_ANGLE

wp\_x=my\_pos\[0\]\+AVOID\_WAYPOINT\_DIST\*\\

math\.sin\(math\.radians\(new\_bear\)\)

wp\_y=my\_pos\[1\]\+AVOID\_WAYPOINT\_DIST\*\\

math\.cos\(math\.radians\(new\_bear\)\)

returnnp\.clip\(\[wp\_x,wp\_y\],\[2,2\],\[158,78\]\)

classAgent\_0\(BaseAgentPolicy\):

def\_get\_dynamic\_roles\(self,gs\):

"""Assignrolesbasedoncurrentpositions\."""

active=\[\]

fortidinself\.team\_ids:

ifnotgs\.get\(\(tid,’is\_tagged’\),False\)\\

andnotgs\.get\(\(tid,’is\_disabled’\),False\)\\

andnotgs\.get\(\(tid,’has\_flag’\),False\):

pos=gs\.get\(\(tid,’pos’\)\)

ifposisnotNone:

active\.append\(

\(tid,np\.asarray\(pos,dtype=float\)\)\)

roles=\{\}

ifnotactive:returnroles

active\.sort\(key=lambdax:

np\.linalg\.norm\(x\[1\]\-self\.my\_flag\_home\)\)

roles\[active\[0\]\[0\]\]="defender"

remaining=active\[1:\]

ifremaining:

remaining\.sort\(key=lambdax:

np\.linalg\.norm\(x\[1\]\-self\.enemy\_flag\_home\)\)

roles\[remaining\[0\]\[0\]\]="primary\_attacker"

iflen\(remaining\)\>1:

roles\[remaining\[1\]\[0\]\]="support\_attacker"

returnroles
Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games

Similar Articles

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

Self-Evolving Deep Research via Joint Generation and Evaluation

Submit Feedback

Similar Articles

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning
Self-Evolving Deep Research via Joint Generation and Evaluation