The Red Queen G\"odel Machine: Co-Evolving Agents and Their Evaluators

arXiv cs.LG Papers

Summary

This paper introduces the Red Queen Gödel Machine (RQGM), an evolutionary framework for recursive self-improvement under non-stationary utilities, where agents and evaluators co-evolve, improving performance on coding tasks, scientific writing, and Olympiad-level proof grading.

arXiv:2606.26294v1 Announce Type: new Abstract: Self-improving agents are state-of-the-art (SOTA) on agentic coding benchmarks and have recently been extended to general domains. However, their search methods generally assume a stationary evaluation criterion: a fixed verifier, benchmark, or labeled dataset that remains valid as the agent improves. This ignores a central feature of evolution: species adapt as their environments change with them. We aim to bring the same principle to recursive self-improvement, making evaluation part of the improvement loop and opening search to evolving evaluators, adversarial objectives, and dynamic utilities that may surpass static benchmarks. We introduce the Red Queen Godel Machine (RQGM), an evolutionary framework for recursive self-improvement under non-stationary utilities. The RQGM makes this possible through controlled utility evolution: search is organized into epochs with a fixed within-epoch evaluation criterion, while the utility can be updated at epoch boundaries, so self-improvement guarantees hold per epoch as the objective evolves across them. We begin by showing that even on verifiable coding tasks, the RQGM improves test pass rate over the prior SOTA by adding a complementary agent-as-a-judge code-review signal. This signal is cheaper and the RQGM uses 1.35x-1.72x fewer tokens. We then turn to scientific paper writing and reviewing, and Olympiad-level proof writing and grading, where the RQGM improves performance over prior self-improving agents: co-evolved writers reach 1.78x-1.86x higher acceptance rates under a diverse agent-as-a-judge panel, while co-evolved graders reach 9% higher ground-truth accuracy. In paper reviewing, the strongest baseline reviewer over-accepts AI-generated papers at up to 1.91x the human rate. The RQGM corrects this by introducing an adversarial objective that discovers reviewers equally stringent on AI and human work.
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:17 AM

# The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators
Source: [https://arxiv.org/abs/2606.26294](https://arxiv.org/abs/2606.26294)
Authors:[Alex Iacob](https://arxiv.org/search/cs?searchtype=author&query=Iacob,+A),[Andrej Jovanović](https://arxiv.org/search/cs?searchtype=author&query=Jovanovi%C4%87,+A),[William F\. Shen](https://arxiv.org/search/cs?searchtype=author&query=Shen,+W+F),[Daniel Burkhardt](https://arxiv.org/search/cs?searchtype=author&query=Burkhardt,+D),[Meghdad Kurmanji](https://arxiv.org/search/cs?searchtype=author&query=Kurmanji,+M),[Nurbek Tastan](https://arxiv.org/search/cs?searchtype=author&query=Tastan,+N),[Lorenzo Sani](https://arxiv.org/search/cs?searchtype=author&query=Sani,+L),[Niccolò Alberto Elia Venanzi](https://arxiv.org/search/cs?searchtype=author&query=Venanzi,+N+A+E),[Ambroise Odonnat](https://arxiv.org/search/cs?searchtype=author&query=Odonnat,+A),[Zeyu Cao](https://arxiv.org/search/cs?searchtype=author&query=Cao,+Z),[Bill Marino](https://arxiv.org/search/cs?searchtype=author&query=Marino,+B),[Xinchi Qiu](https://arxiv.org/search/cs?searchtype=author&query=Qiu,+X),[Nicholas D\. Lane](https://arxiv.org/search/cs?searchtype=author&query=Lane,+N+D)

[View PDF](https://arxiv.org/pdf/2606.26294)

> Abstract:Self\-improving agents are state\-of\-the\-art \(SOTA\) on agentic coding benchmarks and have recently been extended to general domains\. However, their search methods generally assume a stationary evaluation criterion: a fixed verifier, benchmark, or labeled dataset that remains valid as the agent improves\. This ignores a central feature of evolution: species adapt as their environments change with them\. We aim to bring the same principle to recursive self\-improvement, making evaluation part of the improvement loop and opening search to evolving evaluators, adversarial objectives, and dynamic utilities that may surpass static benchmarks\. We introduce the Red Queen Godel Machine \(RQGM\), an evolutionary framework for recursive self\-improvement under non\-stationary utilities\. The RQGM makes this possible through controlled utility evolution: search is organized into epochs with a fixed within\-epoch evaluation criterion, while the utility can be updated at epoch boundaries, so self\-improvement guarantees hold per epoch as the objective evolves across them\. We begin by showing that even on verifiable coding tasks, the RQGM improves test pass rate over the prior SOTA by adding a complementary agent\-as\-a\-judge code\-review signal\. This signal is cheaper and the RQGM uses 1\.35x\-1\.72x fewer tokens\. We then turn to scientific paper writing and reviewing, and Olympiad\-level proof writing and grading, where the RQGM improves performance over prior self\-improving agents: co\-evolved writers reach 1\.78x\-1\.86x higher acceptance rates under a diverse agent\-as\-a\-judge panel, while co\-evolved graders reach 9% higher ground\-truth accuracy\. In paper reviewing, the strongest baseline reviewer over\-accepts AI\-generated papers at up to 1\.91x the human rate\. The RQGM corrects this by introducing an adversarial objective that discovers reviewers equally stringent on AI and human work\.

## Submission history

From: Alex Iacob \[[view email](https://arxiv.org/show-email/d5e4744a/2606.26294)\] **\[v1\]**Wed, 24 Jun 2026 18:38:26 UTC \(1,058 KB\)

Similar Articles

@Phoenixyin13: Incredible! This Red Queen Gödel Machine from NVIDIA, Cambridge University, and other teams is absolutely one of the most important AI papers I've seen recently. This time, the paper directly targets the core bottleneck of self-improving AI: previously, once the evaluator was fixed, it led to agents gaming the system or quickly stagnating...

X AI KOLs Timeline

The Red Queen Gödel Machine paper from NVIDIA, Cambridge University, and other teams solves the bottleneck of recursive self-improvement by co-evolving agents and evaluators. It surpasses existing SOTA on tasks like code and paper writing, providing an important methodology for controlled open-ended AI evolution.

Self-Evolving Deep Research via Joint Generation and Evaluation

arXiv cs.CL

Researchers from HKUST, ByteDance, and UCL propose SCORE, a co-evolutionary training framework that jointly trains an LLM as both a deep research report generator and an evaluator, using a meta-harness to dynamically adjust evaluation difficulty and prevent reward saturation. Experiments show consistent improvement in open-ended research report quality.