Tag
The Red Queen Gödel Machine paper from NVIDIA, Cambridge University, and other teams solves the bottleneck of recursive self-improvement by co-evolving agents and evaluators. It surpasses existing SOTA on tasks like code and paper writing, providing an important methodology for controlled open-ended AI evolution.
This paper introduces the Red Queen Gödel Machine (RQGM), an evolutionary framework for recursive self-improvement under non-stationary utilities, where agents and evaluators co-evolve, improving performance on coding tasks, scientific writing, and Olympiad-level proof grading.
MIT team released a paper on self-evolving skills for Claude Code agents, achieving 71.1% pass rate, surpassing Anthropic's skill-creator by 37 points through a Generate-Test-Verify-Co-Evolve framework.
A commentary emphasizing that despite AI advances, human understanding remains crucial for safe and humane deployment, urging users to verify AI outputs and treat AI with respect.
Introduces the concept of synthetic counteradaptation, where humans and AI systems co-evolve by adapting to each other's strategies, illustrated through examples from Go, social interactions, and geopolitical simulations.
This paper proposes three co-evolutionary mechanisms (evaluator co-evolution, hierarchical deep evaluation, and weakness pressure) for LLM-driven code evolution in adversarial multi-agent games, achieving state-of-the-art results on the MCTF 2026 maritime capture-the-flag task.
EvoTrainer introduces an autonomous training framework that co-evolves LLM policies and training harnesses through empirical feedback, outperforming human-engineered RL baselines on mathematical reasoning, code generation, and long-horizon software engineering tasks.
Proposes CoEvo-AHD, an LLM-driven dual-population co-evolutionary framework for automated heuristic design in bi-component coupled combinatorial optimization problems. It leverages LLMs to co-evolve route and selection operators, using cooperative evaluation and joint crossover to discover complementary heuristics for problems like TTP and TPP.
HarnessForge proposes a meta-adaptive framework for evolving LLM agent systems by jointly optimizing the execution harness and reasoning policy, achieving consistent improvements on Qwen3 backbones across five benchmarks.
SCOPE is a self-play framework for open-ended tasks that co-evolves a Challenger and Solver policy, achieving up to +10.4 points on benchmarks without external supervision.
SEAL proposes a closed-loop framework for jointly evolving LLM agents and their training environments, using diagnosis-guided labels to align both sides. It achieves substantial gains in multi-turn tool-use tasks with only 400 training samples, demonstrating improved robustness and out-of-distribution transfer.
SEAL is a closed-loop co-evolution framework for interactive tool-use agents that addresses Agent-Environment Misalignment by synchronizing policy and environment updates using on-policy trajectories and turn-level diagnosis.
MetaAgent-X introduces an end-to-end reinforcement learning framework that jointly optimizes the design and execution of automatic multi-agent systems, overcoming the frozen-executor ceiling and achieving up to 21.7% gains over existing baselines.
RoboEvolve is a framework that co-evolves a VLM planner and VGM simulator for robotic manipulation, achieving data efficiency with only 500 unlabeled seed images and robust continual learning.
This paper introduces CoCoDA, a framework that uses a co-evolving compositional Directed Acyclic Graph (DAG) to manage tool libraries for augmented agents. It enables small language models to efficiently retrieve and compose tools, allowing an 8B model to match or exceed the performance of a 32B model on reasoning benchmarks.
This paper introduces GAMBIT, a benchmark for evaluating adversarial robustness in multi-agent LLM collectives, featuring adaptive imposters and recalibration modes to address the limitations of existing shallow evaluations.
This paper introduces G-Zero, a verifier-free framework that enables autonomous large language model self-improvement through co-evolutionary training using intrinsic rewards and hint-based guidance. It aims to overcome the limitations of proxy LLM judges in open-ended tasks by deriving supervision from internal distributional dynamics.
This paper introduces TacoMAS, a framework for test-time co-evolution of agent capabilities and communication topology in LLM-based multi-agent systems. It demonstrates that jointly adapting fast capability loops and slow topology loops improves performance and stability over existing baselines.