MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

arXiv cs.AI 05/15/26, 04:00 AM Papers
reinforcement-learning multi-agent-systems end-to-end automatic-design co-evolution llm-agents
Summary
MetaAgent-X introduces an end-to-end reinforcement learning framework that jointly optimizes the design and execution of automatic multi-agent systems, overcoming the frozen-executor ceiling and achieving up to 21.7% gains over existing baselines.
arXiv:2605.14212v1 Announce Type: new Abstract: Automatic multi-agent systems aim to instantiate agent workflows without relying on manually designed or fixed orchestration. However, existing automatic MAS approaches remain only partially adaptive: they either perform training-free test-time search or optimize the meta-level designer while keeping downstream execution agents frozen, which creating a frozen-executor ceiling and leaving the end-to-end training of self-designing and self-executing agentic models unexplored. To address this, we introduce MetaAgent-X, an end-to-end reinforcement learning framework that jointly optimizes automatic MAS design and execution. MetaAgent-X enables script-based MAS generation, execution rollout collection, and credit assignment for both designer and executor trajectories. To support stable and scalable optimization, we propose Executor Designer Hierarchical Rollout and Stagewise Co-evolution to improve training stability and expose the dynamics of designer-executor co-evolution. MetaAgent-X consistently outperforms existing automatic MAS baselines, achieving up to 21.7% gains. Comprehensive ablations show that both designer and executor improve throughout training, and that effective automatic MAS learning follows a stagewise co-evolution process. These results establish end-to-end trainable automatic MAS as a practical paradigm for building self-designing and self-executing agentic models.
Original Article
View Cached Full Text
Cached at: 05/15/26, 06:22 AM
# Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning
Source: [https://arxiv.org/html/2605.14212](https://arxiv.org/html/2605.14212)
Yaolun Zhang1,5,∗, Yujie Zhao2,∗ Nan Wang3,†, Yiran Wu4,5, Jiayu Chang2, Yizhao Chen2 Qingyun Wu4,5, Jishen Zhao2, Huazheng Wang1,5 1Oregon State University2UCSD 3Amazon AGI4Pennsylvania State University5AG2AI, Inc\. \{zhanyaol, huazheng\.wang\}@oregonstate\.edu \{yuz285,yic138,jzhao\}@ucsd\.edu nanww@amazon\.com\{yiran\.wu, qingyun\.wu\}@psu\.edu

###### Abstract

Automatic multi\-agent systems aim to instantiate agent workflows without relying on manually designed or fixed orchestration\. However, existing automatic MAS approaches remain only partially adaptive: they either perform training\-free test\-time search or optimize the meta\-level designer while keeping downstream execution agents frozen, which creating a frozen\-executor ceiling and leaving the end\-to\-end training of self\-designing and self\-executing agentic models unexplored\. To address this, we introduceMetaAgent\-X, an end\-to\-end reinforcement learning framework that jointly optimizes automatic MAS design and execution\. MetaAgent\-X enables script\-based MAS generation, execution rollout collection, and credit assignment for both designer and executor trajectories\. To support stable and scalable optimization, we propose Executor Designer Hierarchical Rollout and Stagewise Co\-evolution to improve training stability and expose the dynamics of designer\-executor co\-evolution\. MetaAgent\-X consistently outperforms existing automatic MAS baselines, achieving up to 21\.7% gains\. Comprehensive ablations show that both designer and executor improve throughout training, and that effective automatic MAS learning follows a stagewise co\-evolution process\. These results establish end\-to\-end trainable automatic MAS as a practical paradigm for building self\-designing and self\-executing agentic models\.

00footnotetext:∗Equal contribution\.00footnotetext:†This work is unrelated to the author’s position at Amazon\.## 1Introduction

Multi\-agent systems \(MAS\) have demonstrated clear advantages over single\-agent approaches across a wide range of domains, including medical decision\-making\(Kimet al\.,[2024](https://arxiv.org/html/2605.14212#bib.bib132); Zhouet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib137)\), scientific discovery\(Suet al\.,[2024](https://arxiv.org/html/2605.14212#bib.bib133); Ghafarollahi and Buehler,[2024](https://arxiv.org/html/2605.14212#bib.bib74)\), financial trading\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.14212#bib.bib138)\), software engineering\(Yuet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib15); Honget al\.,[2023](https://arxiv.org/html/2605.14212#bib.bib114); Chenet al\.,[2024](https://arxiv.org/html/2605.14212#bib.bib139)\), and hardware design\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.14212#bib.bib131); Hoet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib140)\)\. Rather than relying on manually specified or fixed workflows, recent work has increasingly turned to meta\-agents as a paradigm for automatically designing and instantiating the multi\-agent system flow best suited to each task, enabling more adaptive orchestration and execution of MAS\(Gaoet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib1); Yeet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib5); Danget al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib4); Nielsenet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib3); Zhanget al\.,[2025b](https://arxiv.org/html/2605.14212#bib.bib141)\)\.

Meanwhile, as agentic reinforcement learning and self\-evolving paradigms have emerged as promising pathways to transform large language models into interactive, continuously improving decision\-makers\(Wanget al\.,[2025c](https://arxiv.org/html/2605.14212#bib.bib22); Chenget al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib145); Liet al\.,[2025b](https://arxiv.org/html/2605.14212#bib.bib142); Zhaoet al\.,[2026](https://arxiv.org/html/2605.14212#bib.bib2); Zhanget al\.,[2026](https://arxiv.org/html/2605.14212#bib.bib152); Xiaet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib147); Chenet al\.,[2025b](https://arxiv.org/html/2605.14212#bib.bib146); Fuet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib148)\), recent automatic MAS begin to embrace these paradigms, their transition remains incomplete\. Current approaches typically restrict adaptation to non\-training test\-time search, or only optimize the MAS designer, while freezing downstream execution agents\(Yeet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib5); Gaoet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib1); Danget al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib4); Nielsenet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib3); Wanget al\.,[2025a](https://arxiv.org/html/2605.14212#bib.bib143)\)\. Yet, end\-to\-end training of self\-designing and self\-executing auto\-MAS remains unexplored, resulting in two fundamental limitations: 1\)Parameter\-level disjunction\.Existing methods couple the designer and executor only through prompt\-level interactions at inference time, without optimization signals that update the underlying policy based on downstream execution outcomes\. As a result, a frozen executor imposes a hard ceiling on the meta\-designer, while the designer cannot induce specialized execution behaviors from its counterpart\. 2\)Vague co\-evolution dynamics\.The dynamics by which designer and executor could co\-evolve under joint training, and where each role’s improvement remains unclear in practice and in understanding the mechanism\.

As shown in Figure[1](https://arxiv.org/html/2605.14212#S1.F1)\(A\), existing automatic MAS approaches remain partially adaptive: they either search over MAS structures at test time or optimize only the designer while freezing the execution system\. To overcome these limitations, we introduceMetaAgent\-X, an end\-to\-end framework to train agentic models that can self\-design and self\-execute MAS\. Figure[1](https://arxiv.org/html/2605.14212#S1.F1)\(B\) gives an overview of MetaAgent\-X , where task\-conditioned auto MAS designs are instantiated, executed, grouped, and collected for role\-aware policy updates\. To address the first limitation, MetaAgent\-X facilitates script\-based MAS generation, rollout collection, and precise credit assignment for both the designer and the executor\. To address the second limitation, the framework incorporates diverse evolving mechanisms, such as hierarchical rollouts and stage\-wise optimization, allowing us to isolate the critical decision factors that drive auto\-MAS co\-evolution\.

![Refer to caption](https://arxiv.org/html/2605.14212v1/figures/fig_0_v2_upscaled6x_sharp.png)Figure 1:From Partial Adaptation to End\-to\-End Trainable Automatic MAS\.A\.Comparison of three automatic MAS paradigms\.B\.Overview of our training framework\.Our framework consists of three novel design principles\. First, MetaAgent\-X supports flexible designer executor optimization across tasks and domains, where the two components can be trained with diverse evolving mechanisms\. This flexibility enables a systematic analysis of how designer\-executor co\-evolution emerges and how each component contributes to the final automatic MAS capability\. Second, we proposeExecutor\-Designer Hierarchical Rollout, which organizes the interaction process as a two\-level tree structure to support efficient rollout generation and accurate credit assignment\. Third, we proposeStagewise Co\-evolution, which decouples the learning stages of the designer and executor to improve training stability and scalability\. Based on these mechanisms, we conduct comprehensive experiments and ablation studies to evaluate the effectiveness of MetaAgent\-X and analyze the internal dynamics of designer\-executor co\-evolution\. Across six math and code benchmarks and two different base models, MetaAgent\-X outperforms the baselines by up to 21\.7%\.

This paper makes the following contributions:

1. 1\.We proposeMetaAgent\-X, an end\-to\-end training framework for automatic MAS, which explicitly optimizes designer and executor agents together\.
2. 2\.We introduce two mechanisms for stable and scalable meta agent optimization: \(i\)Executor Designer Hierarchical Rollout, which enables structured rollout generation and accurate credit assignment and \(ii\)Stagewise Co\-evolution, which supports decoupled and scalable designer executor learning\.
3. 3\.We demonstrate that MetaAgent\-X achieves consistent gains across diverse math and code benchmarks, surpassing both single agent and automatic MAS baselines by up to 21\.7%
4. 4\.We conduct comprehensive ablation studies to examine the internal mechanisms of meta\-agent co\-evolution\. Our analysis shows that \(1\) both the designer and the executor are optimized throughout training across tasks and domains, and \(2\) such effective co\-evolution follows a stagewise process in which the two components benefit from decoupled optimizations\.

## 2Related work

### 2\.1Meta Agents for Automatic Multi\-Agent Systems

LLM\-based MAS improve complex problem solving by decomposing tasks into specialized roles, structured interactions, and coordination protocols\(Qianet al\.,[2024](https://arxiv.org/html/2605.14212#bib.bib70); Honget al\.,[2024](https://arxiv.org/html/2605.14212#bib.bib115); Wuet al\.,[2023](https://arxiv.org/html/2605.14212#bib.bib76)\)\. Beyond manually designed workflows, recent work introduces meta\-agent that automatically constructs or adapts an executable MAS for each input task\(Yeet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib5); Gaoet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib1); Danget al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib4); Nielsenet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib3); Zhanget al\.,[2025b](https://arxiv.org/html/2605.14212#bib.bib141)\)\. A meta\-agent maps a query into roles, prompts, communication patterns, or execution flows, after which the instantiated system interacts with the environment to produce the final outcome\.

As shown in Fig\.[1](https://arxiv.org/html/2605.14212#S1.F1), existing automatic MAS methods mainly fall into two partial adaptation regimes\. Training\-free adaptation searches over prompts, roles, workflows, or agent organizations at test time without updating model parameters\(Zhanget al\.,[2025b](https://arxiv.org/html/2605.14212#bib.bib141); Danget al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib4)\)\. Semi\-trainable adaptation optimizes a meta\-level designer or controller while keeping downstream executors fixed\. Examples include MAS\-GPT\(Yeet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib5)\), which generates query\-adaptive MAS designs, FlowReasoner\(Gaoet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib1)\), which learns query\-level multi\-agent reasoning flows, and orchestration\-based controllers for dynamic coordination\(Nielsenet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib3)\)\. Also, MAS2\(Wanget al\.,[2025a](https://arxiv.org/html/2605.14212#bib.bib143)\)trains the designer via reinforcement learning while keep using api\-based models as executors\. These methods improve system design or orchestration, but do not jointly optimize executor policies\.

This partial adaptation limits automatic MAS because frozen executors impose a ceiling on final performance and prevent designer executor co\-adaptation\. Chain\-of\-Agents takes a related end\-to\-end direction by training an Agent Foundation Model through multi\-agent distillation and agentic reinforcement learning\(Liet al\.,[2025a](https://arxiv.org/html/2605.14212#bib.bib153)\), but largely optimizes the agent system as a unified behavior and treat MAS as a simple chain of thought without context management\. In contrast, our work studies the end\-to\-end trainable regime, where automatic MAS evolves both how agent systems are designed and how instantiated agents execute them, making designer executor co\-evolution explicit and analyzable\.

### 2\.2Agent System Self Evolution and Multi\-Agent Training

In parallel with meta\-agent based automatic MAS, agentic reinforcement learning and self evolution have emerged as promising paradigms for improving LLM agents through interaction, environment feedback, and iterative experience collection\(Wanget al\.,[2025c](https://arxiv.org/html/2605.14212#bib.bib22); Chenget al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib145); Liet al\.,[2025b](https://arxiv.org/html/2605.14212#bib.bib142); Zhaoet al\.,[2026](https://arxiv.org/html/2605.14212#bib.bib2); Zhanget al\.,[2026](https://arxiv.org/html/2605.14212#bib.bib152); Xiaet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib147); Chenet al\.,[2025b](https://arxiv.org/html/2605.14212#bib.bib146); Fuet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib148)\)\. Within the multi\-agent setting, recent methods such as MAPoRL\(Parket al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib7)\), AT\-GRPO\(Zhaoet al\.,[2026](https://arxiv.org/html/2605.14212#bib.bib2)\), Dr\. MAS\(Fenget al\.,[2026](https://arxiv.org/html/2605.14212#bib.bib9)\), MAE\(Chenet al\.,[2025a](https://arxiv.org/html/2605.14212#bib.bib12)\), and MARFT\(Liaoet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib11)\)mainly focus on improving collaboration under fixed or predefined multi\-agent structures\. These methods study important problems such as multi\-agent credit assignment, coordination, communication, and training stability\. However, the agent organization itself is usually treated as given, rather than as a learned object that should be generated, evaluated, and improved together with execution behavior\.

Our work differs from these self evolution and agent foundation model approaches in both objective and analysis\. Instead of assuming a fixed MAS structure or optimizing an agent system as an undifferentiated whole, we explicitly formulate automatic MAS learning as a designer\-executor co\-evolution problem\. This enables us to break the frozen\-executor performance ceiling while also studying the internal mechanism of automatic MAS co\-evolution\.

## 3Method

### 3\.1End to End Online Meta Agent RL Pipeline

![Refer to caption](https://arxiv.org/html/2605.14212v1/figures/fig_metaagent_evolve_v11_edited_v9_upscaled6x_sharp.png)Figure 2:Overview of the end to end online MetaAgent\-X pipeline\. The Designer first generate a task\-specific multi agent system, then the Executor run the instantiated MAS in the environment\. The collected trajectories and rewards are labeled by role and optimized with GRPO\.Figure[2](https://arxiv.org/html/2605.14212#S3.F2)shows our reinforcement learning pipeline\. Given a task queryqq, the MetaAgent first uses a Designer policyπϑ𝒟𝒟\\pi\_\{\\vartheta\_\{\\mathcal\{D\}\}\}^\{\\mathcal\{D\}\}to generate a task specific multi agent system, and then uses an Executor policyπϑℰℰ\\pi\_\{\\vartheta\_\{\\mathcal\{E\}\}\}^\{\\mathcal\{E\}\}to run the instantiated system in an external environment\. We denote the full trainable parameter set byϑ=\{ϑ𝒟,ϑℰ\}\\vartheta=\\\{\\vartheta\_\{\\mathcal\{D\}\},\\vartheta\_\{\\mathcal\{E\}\}\\\}\. This notation covers both policy sharing and policy splitting\. In the shared policy setting,ϑ𝒟=ϑℰ=θ\\vartheta\_\{\\mathcal\{D\}\}=\\vartheta\_\{\\mathcal\{E\}\}=\\theta; in the split policy setting,ϑ𝒟\\vartheta\_\{\\mathcal\{D\}\}andϑℰ\\vartheta\_\{\\mathcal\{E\}\}are optimized as separate parameter sets\. The learning problem is therefore a coupled online reinforcement learning problem:

d∼πϑ𝒟𝒟\(⋅∣q\),e∼πϑℰℰ\(⋅∣q,d\),R=R\(q,d,e\),d\\sim\\pi\_\{\\vartheta\_\{\\mathcal\{D\}\}\}^\{\\mathcal\{D\}\}\(\\cdot\\mid q\),\\qquad e\\sim\\pi\_\{\\vartheta\_\{\\mathcal\{E\}\}\}^\{\\mathcal\{E\}\}\(\\cdot\\mid q,d\),\\qquad R=R\(q,d,e\),\(1\)wheredddenotes the generated system design,eedenotes the execution trajectory, andRRis the environment feedback returned after execution\. The central challenge is that design and execution are interdependent; their performance is coupled\. Thus, the training pipeline must support online system construction, batched environment execution, trajectory collection, and role aware credit assignment within a unified RL framework\.

#### Online system construction\.

To support compositional system design, we build a training framework contains predefined coordination structures, agent templates, and tool interfaces\. For each query, the Designer composes these building blocks into a customized multi agent system by generating lightweight Python scripts\. These scripts specify the agent roles, interaction protocol, tool usage pattern, and execution control flow\. After a design is instantiated, the Executor runs the generated workflow in the target environment\. Our framework supports batched rollout execution across multiple queries and sampled designs\. For each rollout, the system records the rollouts, environment observations, tool calls, and the outcome\-based rewards \(detailed in Appendix[B](https://arxiv.org/html/2605.14212#A2)\)\.

#### GRPO objective\.

We optimize the role policies with Group Relative Policy Optimization\(GRPO\)\. For each roler∈\{𝒟,ℰ\}r\\in\\\{\\mathcal\{D\},\\mathcal\{E\}\\\}, let𝒢r\\mathcal\{G\}^\{r\}denote the corresponding GRPO group, and letA^ir\\hat\{A\}^\{r\}\_\{i\}be the normalized role specific advantage for trajectoryii\. Letϑr\\vartheta\_\{r\}denote the parameters used by rolerr\. The clipped policy objective for rolerris

ℒr\(ϑr\)=−1\|𝒢r\|∑i∈𝒢rmin⁡\(ρir\(ϑr\)A^ir,clip⁡\(ρir\(ϑr\),1−ϵ,1\+ϵ\)A^ir\),\\mathcal\{L\}\_\{r\}\(\\vartheta\_\{r\}\)=\-\\frac\{1\}\{\|\\mathcal\{G\}^\{r\}\|\}\\sum\_\{i\\in\\mathcal\{G\}^\{r\}\}\\min\\left\(\\rho\_\{i\}^\{r\}\(\\vartheta\_\{r\}\)\\hat\{A\}^\{r\}\_\{i\},\\operatorname\{clip\}\\left\(\\rho\_\{i\}^\{r\}\(\\vartheta\_\{r\}\),1\-\\epsilon,1\+\\epsilon\\right\)\\hat\{A\}^\{r\}\_\{i\}\\right\),\(2\)where

ρir\(ϑr\)=πϑrr\(oi∣ci\)πϑr,oldr\(oi∣ci\)\.\\rho\_\{i\}^\{r\}\(\\vartheta\_\{r\}\)=\\frac\{\\pi\_\{\\vartheta\_\{r\}\}^\{r\}\(o\_\{i\}\\mid c\_\{i\}\)\}\{\\pi\_\{\\vartheta\_\{r,\\mathrm\{old\}\}\}^\{r\}\(o\_\{i\}\\mid c\_\{i\}\)\}\.\(3\)Herecic\_\{i\}is the context of trajectoryii,oio\_\{i\}is the generated output tokens, andπϑr,oldr\\pi\_\{\\vartheta\_\{r,\\mathrm\{old\}\}\}^\{r\}is the role specific behavior policy used for rollout collection\. The role specific advantagesA^𝒟\\hat\{A\}^\{\\mathcal\{D\}\}andA^ℰ\\hat\{A\}^\{\\mathcal\{E\}\}are computed using the hierarchical credit assignment scheme in Section[3\.2](https://arxiv.org/html/2605.14212#S3.SS2)\.

Further, because the Designer and Executor are optimized through coupled online feedback, we introduce a stagewise training schedule that provides a relatively stable environment for optimizing both roles\. We discuss the details in Section[3\.3](https://arxiv.org/html/2605.14212#S3.SS3)\.

### 3\.2Hierarchical Credit Assignment via Tree\-Structured Rollout

A central challenge in training end\-to\-end automatic MAS with RL iscredit assignment: when a multi\-agent system succeeds or fails at a task, is the outcome attributable to the quality of the Designer’s plan or the competence of the Executor’s actions? Standard single\-level rollout conflates these two sources of variation, producing entangled reward signals that destabilize training\. We address this through a tree\-structured rollout scheme that decomposes credit across roles\.

#### Bi\-level Tree\-Structured Rollout\.

For each training questionqq, we construct a two\-level sampling tree\. At the first level, the Designerπϑ𝒟𝒟\\pi\_\{\\vartheta\_\{\\mathcal\{D\}\}\}^\{\\mathcal\{D\}\}generatesMMindependent multi\-agent system designs\{d1,d2,…,dM\}\\\{d\_\{1\},d\_\{2\},\\ldots,d\_\{M\}\\\}, each specifying a distinct agent topology, role assignment, and coordination protocol\. At the second level, for each designdid\_\{i\}, the Executorπϑℰℰ\\pi\_\{\\vartheta\_\{\\mathcal\{E\}\}\}^\{\\mathcal\{E\}\}carries outNNindependent execution rollouts\{ei,1,ei,2,…,ei,N\}\\\{e\_\{i,1\},e\_\{i,2\},\\ldots,e\_\{i,N\}\\\}\. This yields anM×NM\\times Nevaluation matrix per question, where entry\(i,j\)\(i,j\)corresponds to designdid\_\{i\}executed by rolloutei,je\_\{i,j\}, with outcome rewardR\(ei,j,di\)R\(e\_\{i,j\},d\_\{i\}\)\.

#### Decomposed Advantage Estimation\.

The tree structure enables us to compute*separate*advantage estimates for each role via distinct grouping strategies within the GRPO framework\.

Designer advantage\.To isolate the effect of design quality from execution\-level stochasticity, we aggregate over the execution level\. For each designdid\_\{i\}under questionqq, we define the design\-level reward as the mean execution outcome:

R¯i𝒟=1N∑j=1NR\(ei,j,di\)\.\\bar\{R\}^\{\\mathcal\{D\}\}\_\{i\}=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}R\(e\_\{i,j\},d\_\{i\}\)\.\(4\)The advantage for designdid\_\{i\}is then computed by comparing against allMMdesigns for the same question:

A^i𝒟=R¯i𝒟−μq𝒟σq𝒟\+ϵ,whereμq𝒟=1M∑k=1MR¯k𝒟,σq𝒟=std\(\{R¯k𝒟\}k=1M\)\.\\hat\{A\}^\{\\mathcal\{D\}\}\_\{i\}=\\frac\{\\bar\{R\}^\{\\mathcal\{D\}\}\_\{i\}\-\\mu\_\{q\}^\{\\mathcal\{D\}\}\}\{\\sigma\_\{q\}^\{\\mathcal\{D\}\}\+\\epsilon\},\\quad\\text\{where\}\\quad\\mu\_\{q\}^\{\\mathcal\{D\}\}=\\frac\{1\}\{M\}\\sum\_\{k=1\}^\{M\}\\bar\{R\}^\{\\mathcal\{D\}\}\_\{k\},\\quad\\sigma\_\{q\}^\{\\mathcal\{D\}\}=\\text\{std\}\(\\\{\\bar\{R\}^\{\\mathcal\{D\}\}\_\{k\}\\\}\_\{k=1\}^\{M\}\)\.\(5\)By averaging overNNexecutions, the stochasticity of individual rollouts is smoothed out, yielding a reward signal that reflects the intrinsic quality of the design itself\.

Executor advantage\.For each execution rolloutei,je\_\{i,j\}, the Executor produces a set of agent trajectories, denoted by𝒯i,j\\mathcal\{T\}\_\{i,j\}\. We use the outcome reward of the rollout,R\(ei,j,di\)R\(e\_\{i,j\},d\_\{i\}\), as the reward for all trajectories in𝒯i,j\\mathcal\{T\}\_\{i,j\}\. To compute the Executor advantage, we collect all executor trajectories for the same question into a GRPO group:

𝒢qℰ=\{τ\|τ∈𝒯i,j,i∈\[M\],j∈\[N\]\}\.\\mathcal\{G\}\_\{q\}^\{\\mathcal\{E\}\}=\\left\\\{\\tau\\;\\middle\|\\;\\tau\\in\\mathcal\{T\}\_\{i,j\},\\;i\\in\[M\],\\;j\\in\[N\]\\right\\\}\.\(6\)The advantage of each trajectory is then normalized at the question level:

A^ℰ\(τ\)=R\(ei,j,di\)−μqℰσqℰ\+ϵ,τ∈𝒯i,j,\\hat\{A\}^\{\\mathcal\{E\}\}\(\\tau\)=\\frac\{R\(e\_\{i,j\},d\_\{i\}\)\-\\mu\_\{q\}^\{\\mathcal\{E\}\}\}\{\\sigma\_\{q\}^\{\\mathcal\{E\}\}\+\\epsilon\},\\quad\\tau\\in\\mathcal\{T\}\_\{i,j\},\(7\)whereμqℰ\\mu\_\{q\}^\{\\mathcal\{E\}\}andσqℰ\\sigma\_\{q\}^\{\\mathcal\{E\}\}denote the mean and standard deviation of the rollout rewards associated with trajectories in𝒢qℰ\\mathcal\{G\}\_\{q\}^\{\\mathcal\{E\}\}\. Compared with single\-level rollout normalization, question\-level normalization compares executor trajectories generated under both the same and different designs, thereby providing a more stable training signal for the executor\.

### 3\.3Stagewise Executor\-Designer Co\-evolution

The hierarchical rollout in Section[3\.2](https://arxiv.org/html/2605.14212#S3.SS2)provides decomposed reward signals for the Designer \(𝒟\\mathcal\{D\}\) and Executor \(ℰ\\mathcal\{E\}\) roles\. However, since the two roles’ rewards are mutually conditioned, a fundamental optimization challenge arises:*how should we updateπθ\\pi\_\{\\theta\}when𝒟\\mathcal\{D\}andℰ\\mathcal\{E\}serve as each other’s environment?*

The Designer and Executor form a tightly coupled system where each role is the other’s environment: the Executor acts within the MAS structure emitted by the Designer, while the Designer’s reward is decided by the capability of the Executor\. Formally, the return is a nested expectation:

J\(θ\)=𝔼d∼πϑ𝒟𝒟\[𝔼e∼πϑℰℰ\(⋅∣d\)\[R\(e,d\)\]\],J\(\\theta\)=\\mathbb\{E\}\_\{d\\sim\\pi\_\{\\vartheta\_\{\\mathcal\{D\}\}\}^\{\\mathcal\{D\}\}\}\\left\[\\mathbb\{E\}\_\{e\\sim\\pi\_\{\\vartheta\_\{\\mathcal\{E\}\}\}^\{\\mathcal\{E\}\}\(\\cdot\\mid d\)\}\\left\[R\(e,d\)\\right\]\\right\],\(8\)Inspired by multi\-agent RL studies on non\-stationarity and sequential optimization\(Hernandez\-Lealet al\.,[2019](https://arxiv.org/html/2605.14212#bib.bib149); Yuet al\.,[2022](https://arxiv.org/html/2605.14212#bib.bib150); Nekoeiet al\.,[2023](https://arxiv.org/html/2605.14212#bib.bib151)\), we introduce a stagewise schedule that alternates which role provides the trajectories for policy\-gradient updates\. At training steptt, we select the active role by fixed\-length phases ofKKsteps:

\(α𝒟\(t\),αℰ\(t\)\)=\{\(0,1\),⌊t/K⌋mod2=0Executor stage,\(1,0\),⌊t/K⌋mod2=1Designer stage\.\(\\alpha\_\{\\mathcal\{D\}\}^\{\(t\)\},\\;\\alpha\_\{\\mathcal\{E\}\}^\{\(t\)\}\)=\\begin\{cases\}\(0,\\;1\),&\\lfloor t/K\\rfloor\\bmod 2=0\\quad\\text\{Executor stage\},\\\\ \(1,\\;0\),&\\lfloor t/K\\rfloor\\bmod 2=1\\quad\\text\{Designer stage\}\.\\end\{cases\}\(9\)Only trajectories from the active role contribute to the gradient, while the shared parametersϑ\\varthetaare updated continuously\. This isolates each phase to one reward distribution and reduces gradient interference between role\-specific objectives\.

The two stages form a co\-evolutionary loop\. Executor stages improve the ability to solve tasks under the current design distribution, producing more reliable execution outcomes\. Designers then use these lower\-noise returns to learn structures that better exploit the improved Executor\. As a result, the effective reward distribution becomes non\-stationary and the two role\-specific objectives can produce noisy or conflicting updates\.

## 4Experiment

### 4\.1Experiments Setup

#### Models and Compute\.

We train and evaluate Qwen3\(Yang and the Qwen Team,[2025](https://arxiv.org/html/2605.14212#bib.bib68)\)at the 4B and 8B parameter scales in no\-thinking mode\. All experiments are conducted on a single node equipped with eight H200 GPUs\. Unless otherwise specified, both the maximum prompt length and maximum response length are set to81928192tokens\. We use the shared\-policy setting, in which the designer and executor use the same LLM backbone in our main experiments\.

#### Training Procedure\.

Our training proceeds in two stages: a supervised fine\-tuning \(SFT\) cold start followed by reinforcement learning \(RL\) co\-evolution\. During the SFT stage, we initialize the policy by distilling trajectories from DeepSeek\-V3\.2 prompted with diverse workflow templates \(further details regarding the cold start are provided in Appendix[A](https://arxiv.org/html/2605.14212#A1)\)\. In the RL stage, we adopt stagewise designer\-executor co\-evolution with a stage length ofK=30K=30\. For each query, the Designer generatesM=4M=4candidate MAS, and each MAS is executedN=4N=4times\. At each stage, only the active role is updated with a learning rate of5×10−65\\times 10^\{\-6\}, while gradients from the inactive role are masked\.

#### Training Datasets\.

For the SFT cold start, the dataset consists of 3K Designer examples and 8K Executor examples, filtered from correct DeepSeek\-V3\.2 generations\. For the RL stage, we train on a mixture of math and code data to encourage cross\-task generalization\. With an RL batch size of88, half of each batch is sampled from Polaris\-Dataset\-53K\(Anet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib105)\), and the remaining half is sampled from the APPS introductory subset\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.14212#bib.bib60)\)and CodeContests\(DeepMind,[2024](https://arxiv.org/html/2605.14212#bib.bib42)\)\.

#### Baselines\.

We compare with four groups of baselines\. Single\-agent baselines include direct prompting and GRPO, both using the same Qwen3\-4B or 8B backbone as our method\. GRPO is trained on the same math and code mixture\. Search\-based MAS optimization baselines include AFlow\(Zhanget al\.,[2024](https://arxiv.org/html/2605.14212#bib.bib136)\)and ADAS\(Huet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib154)\)\. For AFlow, we use the official best\-searched workflows for math and code\. For ADAS, we use the official best\-searched math agent and run the search protocol for code when no official code agent is released\. RL\-based MAS optimization baselines include ScoreFlow\(Wanget al\.,[2025b](https://arxiv.org/html/2605.14212#bib.bib155)\), MaAS\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.14212#bib.bib156)\)and AFM\(Liet al\.,[2025a](https://arxiv.org/html/2605.14212#bib.bib153)\)\. For AFM\(Liet al\.,[2025a](https://arxiv.org/html/2605.14212#bib.bib153)\), since the officially released checkpoint most comparable in scale to our setting is AFM\-Coder\-7B, we evaluate this checkpoint following the official code\-agent evaluation framework\. All baselines follow the default settings in their original papers or released code\. Details are given in Appendix[D](https://arxiv.org/html/2605.14212#A4)\.

#### Benchmarks\.

We evaluate our models on both mathematical reasoning and code generation benchmarks\. For math, we useAIME24/AIME25\(Mathematical Association of America & AoPS Community,[2024](https://arxiv.org/html/2605.14212#bib.bib62),[2025](https://arxiv.org/html/2605.14212#bib.bib63)\)andOlympiadBench\(Heet al\.,[2024](https://arxiv.org/html/2605.14212#bib.bib64)\)\. We evaluate each AIME benchmark 3 times and report the average\. All math tasks are evaluated with verifier\-checked numeric scoring\. For code, we use three widely adopted benchmarks: APPS\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.14212#bib.bib60)\), LiveCodeBench\-v6\(Jainet al\.,[2024](https://arxiv.org/html/2605.14212#bib.bib41)\), and CodeContests\(DeepMind,[2024](https://arxiv.org/html/2605.14212#bib.bib42)\)\. Code tasks are evaluated by executing generated solutions against the official or benchmark\-provided test cases\.

Table 1:Qwen3 8Bresults on coding and math benchmarks\. Parentheses denote gain over the Single Agent baseline\.Bestandsecond bestresults are highlighted per benchmark\.CodeMathTraining ParadigmMethodLiveCodeBenchAPPSCodeContestsAIME24AIME25OlympiadBenchAvgSingle AgentSA22\.80\(\+0\.00\)30\.20\(\+0\.00\)15\.75\(\+0\.00\)18\.30\(\+0\.00\)20\.90\(\+0\.00\)55\.00\(\+0\.00\)27\.16\(\+0\.00\)SA \+ GRPO25\.70\(\+2\.90\)37\.00\(\+6\.80\)12\.12\(\-3\.63\)18\.30\(\+0\.00\)26\.67\(\+5\.77\)54\.80\(\-0\.20\)29\.10\(\+1\.94\)Search\-based Auto MASAFlow28\.60\(\+5\.80\)27\.40\(\-2\.80\)15\.80\(\+0\.05\)16\.67\(\-1\.63\)20\.83\(\-0\.07\)35\.31\(\-19\.69\)24\.10\(\-3\.06\)ADAS20\.00\(\-2\.80\)27\.00\(\-3\.20\)12\.20\(\-3\.55\)13\.30\(\-5\.00\)16\.70\(\-4\.20\)32\.90\(\-22\.10\)20\.35\(\-6\.81\)RL\-based Auto MASScoreFlow25\.90\(\+3\.10\)26\.50\(\-3\.70\)13\.30\(\-2\.45\)28\.90\(\+10\.60\)20\.00\(\-0\.90\)51\.30\(\-3\.70\)27\.65\(\+0\.49\)MaAS24\.29\(\+1\.49\)30\.00\(\-0\.20\)15\.15\(\-0\.60\)45\.80\(\+27\.50\)29\.20\(\+8\.30\)48\.90\(\-6\.10\)32\.22\(\+5\.06\)AFM\-Coder29\.10\(\+6\.30\)28\.00\(\-2\.20\)21\.20\(\+5\.45\)12\.00\(\-6\.30\)8\.00\(\-12\.90\)21\.80\(\-33\.20\)20\.35\(\-6\.81\)MetaAgent\-XSFT36\.00\(\+13\.20\)32\.00\(\+1\.80\)13\.00\(\-2\.75\)33\.00\(\+14\.70\)20\.00\(\-0\.90\)59\.00\(\+4\.00\)32\.17\(\+5\.01\)MetaAgent\-XRL41\.00\(\+18\.20\)38\.00\(\+7\.80\)17\.00\(\+1\.25\)40\.00\(\+21\.70\)33\.33\(\+12\.10\)61\.00\(\+6\.00\)38\.33\(\+11\.17\)Table 2:Qwen3 4Bresults on coding and math benchmarks\. Parentheses denote gain over the Single Agent baseline\.Bestandsecond bestresults are highlighted per benchmark\.CodeMathTraining ParadigmMethodLiveCodeBenchAPPSCodeContestsAIME24AIME25OlympiadBenchAvgSingle AgentSA13\.80\(\+0\.00\)27\.40\(\+0\.00\)14\.80\(\+0\.00\)20\.00\(\+0\.00\)19\.10\(\+0\.00\)33\.20\(\+0\.00\)21\.38\(\+0\.00\)SA \+ GRPO16\.70\(\+2\.90\)35\.60\(\+8\.20\)18\.60\(\+3\.80\)29\.10\(\+9\.10\)26\.67\(\+7\.57\)47\.10\(\+13\.90\)28\.96\(\+7\.58\)Search\-based Auto MASAFlow28\.00\(\+14\.20\)23\.20\(\-4\.20\)13\.33\(\-1\.47\)16\.67\(\-3\.33\)13\.33\(\-5\.77\)40\.59\(\+7\.39\)22\.52\(\+1\.14\)ADAS16\.00\(\+2\.20\)28\.00\(\+0\.60\)12\.20\(\-2\.60\)10\.00\(\-10\.00\)23\.00\(\+3\.90\)32\.80\(\-0\.40\)20\.33\(\-1\.05\)RL\-based Auto MASScoreFlow23\.36\(\+9\.56\)24\.50\(\-2\.90\)11\.92\(\-2\.88\)26\.40\(\+6\.40\)16\.70\(\-2\.40\)57\.00\(\+23\.80\)26\.65\(\+5\.27\)MaAS24\.29\(\+10\.49\)23\.75\(\-3\.65\)9\.10\(\-5\.70\)16\.70\(\-3\.30\)25\.00\(\+5\.90\)45\.20\(\+12\.00\)24\.01\(\+2\.62\)MetaAgent\-XSFT32\.00\(\+18\.20\)32\.00\(\+4\.60\)6\.00\(\-8\.80\)30\.00\(\+10\.00\)16\.70\(\-2\.40\)57\.00\(\+23\.80\)28\.95\(\+7\.57\)MetaAgent\-XRL36\.00\(\+22\.20\)36\.70\(\+9\.30\)14\.20\(\-0\.60\)33\.33\(\+13\.33\)26\.67\(\+7\.57\)58\.20\(\+25\.00\)34\.18\(\+12\.80\)

### 4\.2Main Results

Tables[1](https://arxiv.org/html/2605.14212#S4.T1)and[2](https://arxiv.org/html/2605.14212#S4.T2)report the performance of our cold\-start and RL\-trained models on six math and code benchmarks\. Compared with the single\-agent GRPO baseline, MetaAgent\-XRLconsistently achieves stronger performance across benchmarks\. By introducing agent collaboration, the RL\-based Auto MAS paradigm effectively overcomes the bottlenecks of isolated generation; for instance, MetaAgent\-XRLreaches an impressive average accuracy of38\.33%38\.33\\%on Qwen3\-8B and34\.18%34\.18\\%on Qwen3\-4B, yielding absolute gains of\+11\.17%\+11\.17\\%and\+12\.80%\+12\.80\\%over the Single Agent baseline, respectively\.

Search\-based Auto MAS baselines generally perform poorly when instantiated with Qwen3\-4B and Qwen3\-8B\. Methods like AFlow and ADAS frequently cause performance degradation \(e\.g\., ADAS drops by−6\.81%\-6\.81\\%to an average of20\.35%20\.35\\%on 8B, and AFlow only achieves22\.52%22\.52\\%on 4B\)\. This significant drop indicates that search\-based methods struggle to generalize across different model scales and are overly reliant on the underlying base models\.

In contrast, MetaAgent\-XRLtranscends these limitations and surpasses all evaluated baselines\. From MetaAgent\-XSFTto MetaAgent\-XRL, our model improves by 6\.17% on average, advancing from a suboptimal foundation to state\-of\-the\-art average performance and demonstrating the effectiveness of our RL pipeline\. Compared to methods that solely train a meta\-agent or optimize workflow selection via RL \(e\.g\., MaAS and ScoreFlow\), MetaAgent\-XRLsuccessfully breaks the performance ceiling of static executors, outperforming the strong MaAS baseline by\+6\.11%\+6\.11\\%on average \(38\.33%38\.33\\%vs\.32\.22%32\.22\\%\) on the 8B model\. Furthermore, when compared to AFM\-Coder, which shows severe performance imbalance and degrades heavily on math tasks, MetaAgent\-X demonstrates exceptional cross\-task generalization\. Moreover, by explicitly adopting a multi\-stage training paradigm, MetaAgent\-X provides clearer and more targeted training signals, enabling a highly effective co\-evolution of the agents’ collaborative capabilities across diverse domains\.

### 4\.3Ablation Studies

To isolate the factors driving stable and scalable meta\-agent optimization, we ablate two central components of our method: Executor\-Designer Hierarchical Rollout and Stagewise Co\-evolution\. We also study the architectural design space by comparing shared\- and separate\-policy training\.

#### Executor\-Designer Hierarchical Rollout\.

Table 3:Ablation results of Executor\-Designer Hierarchical Rollout settings\.RolloutAIME24AIME25M=4,N=4M\{=\}4,N\{=\}440\.0%33\.3%M=8,N=1M\{=\}8,N\{=\}133\.3%30\.0%We compare different hierarchical rollout configurations\. Our main experiments useM=4M\{=\}4andN=4N\{=\}4, where each query samples four candidate designs and executes each design four times\. We additionally evaluate a flatter rollout setting withM=8M\{=\}8andN=1N\{=\}1, where more designs are sampled but each design is executed only once\. As shown in Table[3](https://arxiv.org/html/2605.14212#S4.T3), the hierarchical setting achieves better performance, improving AIME24 from 33\.3% to 40\.0% and AIME25 from 30\.0% to 33\.3%\. This suggests that repeated executions of each sampled design provide a more reliable estimate of downstream utility, leading to more stable credit assignment\.

#### Does Stagewise Co\-evolution Help?

![Refer to caption](https://arxiv.org/html/2605.14212v1/figures/avg_reward_curves.png)Figure 3:Training\-reward dynamics ablations of the proposed stagewise co\-evolution\.We compare the proposed schedule on Qwen3\-8B with three variants: coupled training, executor\-only training, and designer\-only training\. In the coupled setting, trajectories from both roles update the shared policy simultaneously\. As shown in Figure[3](https://arxiv.org/html/2605.14212#S4.F3), this variant improves quickly at first but later collapses; during evaluation, the model often repeats meaningless tokens until reaching the maximum length\. Designer\-only training brings limited improvement, suggesting that optimizing designs alone does not reliably improve MAS performance\. Executor\-only training improves correctness rapidly but soon saturates, indicating a ceiling imposed by the fixed design policy\.

Table 4:Stagewise ablation\.VariantMathCodeCoupled36\.7%25\.2%Designer\-only38\.6%27\.5%Executor\-only39\.6%30\.7%Stagewise44\.8%32\.0%In contrast, stagewise training shows a clear staircase\-shaped learning curve: reward remains relatively stable during designer phases and rises sharply after switching to executor phases\. The accompanying table further shows that stagewise training achieves the best performance on both math and code benchmarks\. These results suggest that stagewise Designer\-Executor optimization provides a more stable and effective training path\.

#### Shared Policy vs\. Separate Policy\.

Table 5:Ablation results of shared vs\. separate policy\.VariantAIME24AIME25Shared40\.0%33\.3%Separate33\.3%26\.7%We compare two policy parameterizations: a shared policy for both Designer and Executor, with role\-specific prompts specifying their behaviors, and separate role\-specific policies\. As shown in Table[5](https://arxiv.org/html/2605.14212#S4.T5), the shared policy consistently outperforms separate policies on both AIME24 and AIME25\. This suggests that Designer and Executor learning are not independent subtasks, but coupled components of the same meta\-agent optimization problem\. Sharing representations allows the training signal from one role to serve as an inductive bias for the other, improving generalization and data efficiency while reducing overfitting to role\-specific trajectories\.

### 4\.4Analysis

We provide a more detailed analysis for the experiment results in Appendix[E](https://arxiv.org/html/2605.14212#A5), including stage\-length sensitivity, per\-query design diversity, and end\-to\-end case studies\. Here we summarize two main observations: RL changes both the*structures emitted by the Designer*and the*quality of Executor behavior*\.

#### Per\-task structure selection\.

Table 6:Structure share selected by the RL designer\.BenchmarkSingleReflectionEnsembleAIME 202418\.9%70\.0%11\.1%AIME 202515\.6%73\.3%11\.1%OlympiadBench46\.4%44\.8%8\.8%CodeContests26\.7%62\.4%10\.9%LiveCodeBench43\.5%52\.6%3\.8%APPS55\.2%43\.8%1\.0%Table[6](https://arxiv.org/html/2605.14212#S4.T6)reports the top three structures most frequently generated by the RL\-trained designer\. TheSinglestructure uses one agent to solve the problem directly\. TheReflectionstructure uses one agent to generate an initial solution and another agent to provide refinements\. TheEnsemblestructure runs agents with different roles in parallel and uses a judge agent to select or synthesize the final answer\. The structure selection is clearly task\-dependent\. On harder math benchmarks such as AIME, the designer selectsreflectionfor more than 70% of problems, indicating a preference for iterative verification on challenging reasoning tasks\. On relatively easier tasks such as OlympiadBench and APPS, it routes a larger fraction of problems to asinglestructure\.Ensembleis mainly selected for competition\-style math and code tasks\. These results suggest that MetaAgent learns to adapt the agent structure according to task characteristics\.

ExampleSFT ModelRL ModelMathBetter Design×\\timesEnsemble judge\. All solvers share the wrong circle\-packing model; judge reports contradiction but cannot repair\.✓\\checkmarkSolver–critic reflection\. Critic localizes the geometry error; solver switches to similar triangles\.CodeBetter Execute×\\timesSame reflection structure, but executor keeps double\-counting divisors after sample outputs are2×2\\timestoo large\.✓\\checkmarkSame reflection structure, but executor uses tests to restore the one\-count\-per\-divisor invariant\.Table 7:SFT\-to\-RL case comparison: RL improves both MAS design and executor repair behavior\.
#### Which role brings the improvement?

To disentangle whether the designer or the executor is the primary driver of this success, we analyzed AIME25, and the results show: half of the improvements stem from the executor successfully solving the problem under the*same*structural pattern assigned by SFT, demonstrating clear execution\-side capability gains\. The remaining50%50\\%of improvements occur when the designer flips to a more efficient pattern\. We include two qualitative examples to illustrate how RL changes model behavior after cold start\. The math example highlights better*design*\(choosing a repairable MAS structure\), while the code example highlights better*execution*under the same solver–tester structure\. Details are in Appendix[E](https://arxiv.org/html/2605.14212#A5)\.

## 5Discussions

We introduced MetaAgent\-X , the first end\-to\-end reinforcement learning framework that jointly optimizes the designer and executor of an automatic multi\-agent system through hierarchical rollouts and stagewise co\-evolution\. Across six math and code benchmarks and two model scales, MetaAgent\-X consistently surpasses both human\-designed and existing automatic MAS baselines by up to 21\.7%, while exposing the internal dynamics through which designer and executor mutually improve\. Moreover, MetaAgent\-X suggests a path toward foundation models with native multi\-agent capabilities, where MAS becomes an internal mechanism for reasoning and context management rather than an external, human\-designed harness\. However, our experiments are constrained by the computational resources, so we do not perform an exhaustive scaling study over larger backbone models or longer training budgets\. Future work can further examine how the proposed trainable automatic MAS framework scales with model size, task diversity, and rollout budget\.

## References

- POLARIS: a post\-training recipe for scaling reinforcement learning on advanced reasoning models\.External Links:[Link](https://hkunlp.github.io/blog/2025/Polaris)Cited by:[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px3.p1.1)\.
- D\. Chen, S\. Lin, M\. Zeng, D\. Zan, J\. Wang, A\. Cheshkov, J\. Sun, H\. Yu, G\. Dong, A\. Aliev, J\. Wang, X\. Cheng, G\. Liang, Y\. Ma, P\. Bian, T\. Xie, and Q\. Wang \(2024\)CodeR: issue resolving with multi\-agent and task graphs\.arXiv preprint arXiv:2406\.01304\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1)\.
- Y\. Chen, Y\. Wang, S\. Zhu, H\. Yu, T\. Feng, M\. Zhang, M\. Patwary, and J\. You \(2025a\)Multi\-agent evolve: llm self\-improve through co\-evolution\.External Links:2510\.23595,[Link](https://arxiv.org/abs/2510.23595)Cited by:[§2\.2](https://arxiv.org/html/2605.14212#S2.SS2.p1.1)\.
- Z\. Chen, Z\. Zhao, K\. Zhang, B\. Liu, Q\. Qi, Y\. Wu, T\. Kalluri, S\. Cao, Y\. Xiong, H\. Tong, H\. Yao, H\. Li, J\. Zhu, X\. Li, D\. Song, B\. Li, J\. Weston, and D\. Huynh \(2025b\)Scaling agent learning via experience synthesis\.arXiv preprint arXiv:2511\.03773\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.14212#S2.SS2.p1.1)\.
- M\. Cheng, J\. Ouyang, S\. Yu, R\. Yan, Y\. Luo, Z\. Liu, D\. Wang, Q\. Liu, and E\. Chen \(2025\)Agent r1: training powerful llm agents with end to end reinforcement learning\.arXiv preprint arXiv:2511\.14460\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.14212#S2.SS2.p1.1)\.
- Y\. Dang, C\. Qian, X\. Luo, J\. Fan, Z\. Xie, R\. Shi, W\. Chen, C\. Yang, X\. Che, Y\. Tian, X\. Xiong, L\. Han, Z\. Liu, and M\. Sun \(2025\)Multi\-agent collaboration via evolving orchestration\.arXiv preprint arXiv:2505\.19591\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1),[§1](https://arxiv.org/html/2605.14212#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p2.1)\.
- DeepMind \(2024\)CodeContests\.Note:https://github\.com/google\-deepmind/code\_contestsGitHub repository; archived Dec 6, 2024Cited by:[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px5.p1.1)\.
- L\. Feng, L\. Zheng, S\. He, F\. Zhang, and B\. An \(2026\)Dr\. mas: stable reinforcement learning for multi\-agent llm systems\.arXiv preprint arXiv:2602\.08847\.Cited by:[§2\.2](https://arxiv.org/html/2605.14212#S2.SS2.p1.1)\.
- D\. Fu, C\. Yang, L\. Wen, X\. Yang, Y\. Shen, Y\. Wang, and B\. Shi \(2025\)EvolveR: self evolving llm agents through an experience driven lifecycle\.arXiv preprint arXiv:2510\.16079\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.14212#S2.SS2.p1.1)\.
- H\. Gao, Y\. Liu, Y\. He, L\. Dou, C\. Du, Z\. Deng, B\. Hooi, M\. Lin, and T\. Pang \(2025\)FlowReasoner: reinforcing query\-level meta\-agents\.arXiv preprint arXiv:2504\.15257\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1),[§1](https://arxiv.org/html/2605.14212#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p2.1)\.
- A\. Ghafarollahi and M\. J\. Buehler \(2024\)SciAgents: automating scientific discovery through multi\-agent intelligent graph reasoning\.External Links:2409\.05556Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1)\.
- C\. He, R\. Luo, Y\. Bai, S\. Hu,et al\.\(2024\)OlympiadBench: a challenging benchmark for promoting agi with olympiad\-level bilingual multimodal scientific problems\.InACL,External Links:[Link](https://arxiv.org/abs/2402.14008)Cited by:[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px5.p1.1)\.
- D\. Hendrycks, S\. Basart, S\. Kadavath, M\. Mazeika, A\. Zou, D\. Song, and J\. Steinhardt \(2021\)Measuring coding challenge competence with apps\.arXiv:2105\.09938\.External Links:[Link](https://arxiv.org/abs/2105.09938)Cited by:[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px5.p1.1)\.
- P\. Hernandez\-Leal, M\. Kaisers, T\. Baarslag, and E\. M\. de Cote \(2019\)A survey of learning in multiagent environments: dealing with non\-stationarity\.External Links:1707\.09183,[Link](https://arxiv.org/abs/1707.09183)Cited by:[§3\.3](https://arxiv.org/html/2605.14212#S3.SS3.p2.2)\.
- C\. Ho, J\. Gong, Y\. Bai, C\. Deng, H\. Ren, and B\. Khailany \(2025\)Marco: configurable graph\-based task solving and multi\-ai agents framework for hardware design\.arXiv preprint arXiv:2504\.01962\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, C\. Zhang, J\. Wang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin, L\. Zhou, C\. Ran, L\. Xiao, C\. Wu, and J\. Schmidhuber \(2023\)MetaGPT: meta programming for a multi\-agent collaborative framework\.arXiv preprint arXiv:2308\.00352\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, C\. Zhang, J\. Wang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin,et al\.\(2024\)MetaGPT: meta programming for a multi\-agent collaborative framework\.Cited by:[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p1.1)\.
- S\. Hu, C\. Lu, and J\. Clune \(2025\)Automated design of agentic systems\.External Links:2408\.08435,[Link](https://arxiv.org/abs/2408.08435)Cited by:[§D\.1](https://arxiv.org/html/2605.14212#A4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px4.p1.1)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2024\)LiveCodeBench: holistic and contamination free evaluation of large language models for code\.arXiv preprint arXiv:2403\.07974\.External Links:[Link](https://arxiv.org/abs/2403.07974)Cited by:[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px5.p1.1)\.
- Y\. Kim, C\. Park, H\. Jeong, Y\. S\. Chan, X\. Xu, D\. McDuff, H\. Lee, M\. Ghassemi, C\. Breazeal, and H\. W\. Park \(2024\)MDAgents: an adaptive collaboration of llms for medical decision\-making\.arXiv preprint arXiv:2404\.15155\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1)\.
- W\. Li, J\. Lin, Z\. Jiang, J\. Cao, X\. Liu, J\. Zhang, Z\. Huang, Q\. Chen, W\. Sun, Q\. Wang, H\. Lu, T\. Qin, C\. Zhu, Y\. Yao, S\. Fan, X\. Li, T\. Wang, P\. Liu, K\. Zhu, H\. Zhu, D\. Shi, P\. Wang, Y\. Guan, X\. Tang, M\. Liu, Y\. E\. Jiang, J\. Yang, J\. Liu, G\. Zhang, and W\. Zhou \(2025a\)Chain\-of\-agents: end\-to\-end agent foundation models via multi\-agent distillation and agentic rl\.External Links:2508\.13167Cited by:[§D\.2](https://arxiv.org/html/2605.14212#A4.SS2.SSS0.Px3.p1.1),[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p3.1),[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px4.p1.1)\.
- Z\. Li, H\. Zhang, S\. Han, S\. Liu, J\. Xie, Y\. Zhang, Y\. Choi, J\. Zou, and P\. Lu \(2025b\)In\-the\-flow agentic system optimization for effective planning and tool use\.External Links:2510\.05592,[Link](https://arxiv.org/abs/2510.05592)Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.14212#S2.SS2.p1.1)\.
- J\. Liao, M\. Wen, J\. Wang, and W\. Zhang \(2025\)MARFT: multi\-agent reinforcement fine\-tuning\.External Links:2504\.16129,[Link](https://arxiv.org/abs/2504.16129)Cited by:[§2\.2](https://arxiv.org/html/2605.14212#S2.SS2.p1.1)\.
- Mathematical Association of America & AoPS Community \(2024\)AIME 2024 problems \(aops wiki\)\.Note:[https://artofproblemsolving\.com/wiki/index\.php/2024\_AIME\_I](https://artofproblemsolving.com/wiki/index.php/2024_AIME_I)&[https://artofproblemsolving\.com/wiki/index\.php/2024\_AIME\_II\_Problems](https://artofproblemsolving.com/wiki/index.php/2024_AIME_II_Problems)Accessed 2025\-09\-11Cited by:[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px5.p1.1)\.
- Mathematical Association of America & AoPS Community \(2025\)AIME 2025 problems \(aops wiki\)\.Note:[https://artofproblemsolving\.com/wiki/index\.php/2025\_AIME\_I\_Problems](https://artofproblemsolving.com/wiki/index.php/2025_AIME_I_Problems)&[https://artofproblemsolving\.com/wiki/index\.php/2025\_AIME\_II\_Problems](https://artofproblemsolving.com/wiki/index.php/2025_AIME_II_Problems)Accessed 2025\-09\-11Cited by:[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px5.p1.1)\.
- H\. Nekoei, A\. Badrinaaraayanan, A\. Sinha, M\. Amini, J\. Rajendran, A\. Mahajan, and S\. Chandar \(2023\)Dealing with non\-stationarity in decentralized cooperative multi\-agent deep reinforcement learning via multi\-timescale learning\.External Links:2302\.02792,[Link](https://arxiv.org/abs/2302.02792)Cited by:[§3\.3](https://arxiv.org/html/2605.14212#S3.SS3.p2.2)\.
- S\. Nielsen, E\. Cetin, P\. Schwendeman, Q\. Sun, J\. Xu, and Y\. Tang \(2025\)Learning to orchestrate agents in natural language with the conductor\.arXiv preprint arXiv:2512\.04388\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1),[§1](https://arxiv.org/html/2605.14212#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p2.1)\.
- C\. Park, S\. Han, X\. Guo, A\. Ozdaglar, K\. Zhang, and J\. Kim \(2025\)MAPoRL: multi\-agent post\-co\-training for collaborative large language models with reinforcement learning\.arXiv preprint arXiv:2502\.18439\.Cited by:[§2\.2](https://arxiv.org/html/2605.14212#S2.SS2.p1.1)\.
- C\. Qian, W\. Liu, H\. Liu, N\. Chen, Y\. Dang, J\. Li, C\. Yang, W\. Chen, Y\. Su, X\. Cong, J\. Xu, D\. Li, Z\. Liu, and M\. Sun \(2024\)ChatDev: communicative agents for software development\.InACL 2024,External Links:2307\.07924Cited by:[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p1.1)\.
- H\. Su, R\. Chen, S\. Tang, Z\. Yin, X\. Zheng, J\. Li, B\. Qi, Q\. Wu, H\. Li, W\. Ouyang, P\. Torr, B\. Zhou, and N\. Dong \(2024\)Many heads are better than one: improved scientific idea generation by a llm\-based multi\-agent system\.arXiv preprint arXiv:2410\.09403\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1)\.
- K\. Wang, G\. Zhang, M\. Ye, X\. Deng, D\. Wang, X\. Hu, J\. Guo, Y\. Liu, and Y\. Guo \(2025a\)MAS2: self\-generative, self\-configuring, self\-rectifying multi\-agent systems\.External Links:2509\.24323,[Link](https://arxiv.org/abs/2509.24323)Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p2.1)\.
- Y\. Wang, L\. Yang, G\. Li, M\. Wang, and B\. Aragam \(2025b\)ScoreFlow: mastering llm agent workflows via score\-based preference optimization\.External Links:2502\.04306,[Link](https://arxiv.org/abs/2502.04306)Cited by:[§D\.2](https://arxiv.org/html/2605.14212#A4.SS2.SSS0.Px1.p1.2),[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px4.p1.1)\.
- Z\. Wang, K\. Wang, Q\. Wang, P\. Zhang, L\. Li, Z\. Yang, X\. Jin, K\. Yu, M\. N\. Nguyen, L\. Liu,et al\.\(2025c\)Ragen: understanding self\-evolution in llm agents via multi\-turn reinforcement learning\.arXiv preprint arXiv:2504\.20073\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.14212#S2.SS2.p1.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu, A\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang \(2023\)AutoGen: enabling next\-gen llm applications via multi\-agent conversation\.External Links:2308\.08155Cited by:[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p1.1)\.
- P\. Xia, K\. Zeng, J\. Liu, C\. Qin, F\. Wu, Y\. Zhou, C\. Xiong, and H\. Yao \(2025\)Agent0: unleashing self evolving agents from zero data via tool integrated reasoning\.arXiv preprint arXiv:2511\.16043\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.14212#S2.SS2.p1.1)\.
- Y\. Xiao, E\. Sun, D\. Luo, and W\. Wang \(2024\)TradingAgents: multi\-agents llm financial trading framework\.arXiv preprint arXiv:2412\.20138\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1)\.
- A\. Yang and the Qwen Team \(2025\)Qwen3 technical report\.arXiv:2505\.09388\.External Links:[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px1.p1.1)\.
- R\. Ye, S\. Tang, R\. Ge, Y\. Du, Z\. Yin, S\. Chen, and J\. Shao \(2025\)MAS\-gpt: training llms to build llm\-based multi\-agent systems\.arXiv preprint arXiv:2503\.03686\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1),[§1](https://arxiv.org/html/2605.14212#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p2.1)\.
- C\. Yu, A\. Velu, E\. Vinitsky, J\. Gao, Y\. Wang, A\. Bayen, and Y\. Wu \(2022\)The surprising effectiveness of ppo in cooperative, multi\-agent games\.External Links:2103\.01955,[Link](https://arxiv.org/abs/2103.01955)Cited by:[§3\.3](https://arxiv.org/html/2605.14212#S3.SS3.p2.2)\.
- Z\. Yu, H\. Zhang, Y\. Zhao, H\. Huang, M\. Yao, K\. Ding, and J\. Zhao \(2025\)Orcaloca: an llm agent framework for software issue localization\.arXiv preprint arXiv:2502\.00350\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1)\.
- G\. Zhang, L\. Niu, J\. Fang, K\. Wang, L\. Bai, and X\. Wang \(2025a\)Multi\-agent architecture search via agentic supernet\.External Links:2502\.04180,[Link](https://arxiv.org/abs/2502.04180)Cited by:[§D\.2](https://arxiv.org/html/2605.14212#A4.SS2.SSS0.Px2.p1.8),[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px4.p1.1)\.
- J\. Zhang, J\. Xiang, Z\. Yu, F\. Teng, X\. Chen, J\. Chen, M\. Zhuge, X\. Cheng, S\. Hong, J\. Wang, B\. Zheng, B\. Liu, Y\. Luo, and C\. Wu \(2024\)AFlow: automating agentic workflow generation\.arXiv preprint arXiv:2410\.10762\.Cited by:[§D\.1](https://arxiv.org/html/2605.14212#A4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.14212#S4.SS1.SSS0.Px4.p1.1)\.
- Y\. Zhang, X\. Liu, and C\. Xiao \(2025b\)MetaAgent: automatically constructing multi\-agent systems based on finite state machines\.External Links:2507\.22606,[Link](https://arxiv.org/abs/2507.22606)Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.14212#S2.SS1.p2.1)\.
- Y\. Zhang, R\. Wang, J\. Wang, Y\. Tang, X\. Zheng, H\. Duan, H\. Lu, H\. Deng, and L\. Lu \(2026\)EVA: efficient reinforcement learning for end\-to\-end video agent\.External Links:2603\.22918,[Link](https://arxiv.org/abs/2603.22918)Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.14212#S2.SS2.p1.1)\.
- Y\. Zhao, L\. Hu, Y\. Wang, M\. Hou, H\. Zhang, K\. Ding, and J\. Zhao \(2026\)Stronger\-mas: multi\-agent reinforcement learning for collaborative llms\.External Links:2510\.11062,[Link](https://arxiv.org/abs/2510.11062)Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.14212#S2.SS2.p1.1)\.
- Y\. Zhao, H\. Zhang, H\. Huang, Z\. Yu, and J\. Zhao \(2024\)MAGE: a multi\-agent engine for automated rtl code generation\.External Links:2412\.07822,[Link](https://arxiv.org/abs/2412.07822)Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1)\.
- Y\. Zhou, L\. Song, and J\. Shen \(2025\)MAM: modular multi\-agent framework for multi\-modal medical diagnosis via role\-specialized collaboration\.arXiv preprint arXiv:2506\.19835\.Cited by:[§1](https://arxiv.org/html/2605.14212#S1.p1.1)\.

## Appendix

## Appendix ACold Start Details

We cold\-start the policy by distilling both sides of the generated multi\-agent system: the*Designer*, which writes an executable workflow, and the*Executors*, which solve the problem inside the generated workflow\. The Designer is prompted with a bank of workflow templates implemented in our codebase\. The template bank containssingle\-agent,ensemble\-voting,solver\-critic reflection,solver\-tester, and etc\. Each template specifies both the workflow topology and the role\-level prompts; for example, ensemble templates instantiate strategy\-diverse solvers and a judge, while reflection templates instantiate a solver and a critic/verifier loop\.

For each training question, we sample in\-context examples from this template bank and ask DeepSeek\-V3\.2 to synthesize a complete workflow program\. The data\-generation pipeline samples multiple workflow designs per question and logs both the Designer conversation and all Executor conversations produced when the workflow is run\. We then retain trajectories whose final answer is judged correct, yielding 3K Designer examples and 8K Executor examples for supervised cold start\. This gives the model an initial ability to map a problem to an appropriate multi\-agent program and to act as the specialized agents inside that program\.

Before training our smaller policy, we also evaluate whether the same workflow prompting strategy helps a strong proprietary model\. As shown in Table[8](https://arxiv.org/html/2605.14212#A1.T8), prompting DeepSeek to solve through a generated multi\-agent workflow improves AIME 2024 accuracy from 63\.3% to 66\.7%, which suggests that the workflow interface is not only a crutch for weaker models: even when the underlying model is already strong, explicit role decomposition and verification can recover additional correct solutions\. This motivates using DeepSeek\-V3\.2 as the teacher for cold\-starting both the Designer and Executor behaviors before reinforcement learning\.

MethodAccuracyDirect DeepSeek63\.3%DeepSeek prompted with MAS workflow66\.7%Table 8:Prompting\-only comparison on AIME 2024 using DeepSeek as the underlying model\. The MAS prompt asks DeepSeek to first synthesize and run a multi\-agent workflow rather than directly answer with a single response\.
## Appendix BReward Design

The outcome\-based rewardR\(ei,j,di\)R\(e\_\{i,j\},d\_\{i\}\)is composed of two terms:

R\(ei,j,di\)=Rcorrect\(ei,j\)\+λ⋅Rformat\(ei,j\),R\(e\_\{i,j\},d\_\{i\}\)=R\_\{\\text\{correct\}\}\(e\_\{i,j\}\)\+\\lambda\\cdot R\_\{\\text\{format\}\}\(e\_\{i,j\}\),\(10\)whereRcorrectR\_\{\\text\{correct\}\}evaluates the functional correctness of the final solution via environment feedback, andRformatR\_\{\\text\{format\}\}incentivizes structured agent behavior\. We setλ=0\.4\\lambda=0\.4\.

The correctness rewardRcorrect∈\{0,1\}R\_\{\\text\{correct\}\}\\in\\\{0,1\\\}is a strict binary signal determined by the specific domain environment:

- •Math Verification:The final parsed answer is evaluated against the ground\-truth solution\. To account for algebraically equivalent expressions, we utilize a symbolic math engine \(e\.g\., SymPy\) to robustly verify the correctness of the final mathematical output\.
- •Code Execution:The final generated program is compiled and executed against the dataset’s hidden unit tests\. The rewardRcorrect=1R\_\{\\text\{correct\}\}=1is assigned if and only if the code successfully passes all unit tests without exceeding the environment’s execution time or memory constraints\.

The format rewardRformatR\_\{\\text\{format\}\}acts as a regularizer and consists of two components:

1. 1\.Solution formatting\.The final agent must produce its answer within a standardized output format, ensuring that the solution is reliably parseable for automated evaluation\.
2. 2\.Delivery formatting\.Inter\-agent messages must be strictly enclosed within<delivery\>…</delivery\>tags\. This constraint serves a dual purpose: it establishes a structured, easily parsable communication protocol, and crucially, it incentivizes agents to*distill*relevant information into concise deliverables rather than forwarding their entire reasoning trace\. Without this constraint, agents tend to broadcast full outputs, unnecessarily inflating the context window without improving coordination quality\.

## Appendix CResult Analysis

#### Sensitivity Analysis on Stage Length\.

Given that stagewise optimization is essential for stable designer–executor co\-evolution, we further ablate how frequently the active role should be switched\. Figure[4](https://arxiv.org/html/2605.14212#A3.F4)compares three alternation intervals: 1\-step, 10\-step, and 30\-step switching\. Alternating the active role at every step leads to highly unstable training: neither role accumulates sufficient role\-consistent gradient signal before being interrupted, and the training run collapses after approximately 150 steps\. Increasing the interval to 10 or 30 steps substantially improves stability\. Among them, the 30\-step schedule achieves the highest final reward and exhibits the clearest upward trend\. We therefore adopt 30\-step alternation as the default setting in all main experiments\.

![Refer to caption](https://arxiv.org/html/2605.14212v1/figures/alt_freq_ablation.png)Figure 4:Sensitivity analysis on the stage length for designer–executor alternation\. One\-step alternation is unstable and collapses during training, while longer stages provide more stable role\-specific optimization\. The 30\-step schedule achieves the best final reward and is used as the default setting in our main experiments\.
#### Per\-query design diversity\.

Besides the structures, MetaAgent design different task\-specific and various agents role\. Across all the workflows the designer emits in evaluation, it instantiates a vocabulary of54 distinct role names; System prompts are even more diverse:77\.5% of agents receive a byte\-unique, task\-specific prompt\. Each workflow is therefore a freshly synthesised program tailored to the problem at hand, not an instantiation of a pre\-defined template; the structural taxonomy in Table[6](https://arxiv.org/html/2605.14212#S4.T6)is a coarse skeleton, while the body of every workflow is per\-query content\.

#### Which role brings the improvement?

RL changes*both*\(i\) which structure the designer selects and \(ii\) how well the executor performs under it\. The designer’s routing distribution moves uniformly across math benchmarks \(Table[10](https://arxiv.org/html/2605.14212#A3.T10)\):ensemble\+judgeloses∼30\\sim\\\!30pp on every benchmark and the share is redirected mostly toreflection, with a smaller push towardsingle\. To disentangle the two effects we look at the same 30 AIME 2025 problems solved by SFT and RL\. Of the problems RL solves but SFT does not, 50% use the*same*pattern as SFT, which show an executor\-side improvement\. And 50% are produced by a designer flip to a different \(and in those cases simpler\) pattern, indicating the improvement from SFT→\\toRL also benefits from a designer flip\.

Top\-10 role namesCodeSolverAlgebraicSolverMathSolverBruteForceSolverUnitTestAgentCombinatorialSolverMathCriticEdgeCaseSolverMathJudgeOptimalSolverTable 9:Top\-10 role names the designer emits across2,5742\{,\}574workflows\.

singlerefl\.ens\.BenchmarkSFTRLSFTRLSFTRLAIME 202416\.718\.938\.970\.044\.411\.1AIME 20257\.815\.641\.173\.351\.111\.1OlympiadBench42\.346\.421\.144\.836\.68\.8Table 10:Pattern share \(%\) on math benchmarks: SFT cold\-start vs\. RL\.

## Appendix DBaseline Details

We describe the baseline implementations used in our experiments\. Unless otherwise stated, all baselines use the same execution backbone as our method, instantiated with Qwen3 4B or Qwen3 8B according to the corresponding experimental setting\. All reported results are evaluated on the same final test split and metric as our method\. For baselines that require search or training, we follow the default protocol of the original paper or released code, while matching the rollout budget of our method whenever the method exposes the corresponding parameter\.

### D\.1Search Based MAS Optimization Baselines

#### AFlow\.

AFlow\[Zhanget al\.,[2024](https://arxiv.org/html/2605.14212#bib.bib136)\]searches over code represented agentic workflows with MCTS\. We follow the official paper and code settings\. Specifically, AFlow usessample=4,initial\_round=1,max\_rounds=20,validation\_rounds=5, and early stopping enabled\. Thesample=4 setting matches the number of candidate MAS designs generated by our Designer for each query\. During workflow evaluation, we execute each candidate workflowN=4N=4times when stochastic execution is supported, matching our execution budget\. We keep the original domain specific operator sets:Custom,ScEnsemble, andProgrammerfor math, andCustom,CustomCodeGenerate,ScEnsemble, andTestfor code\. For each task domain, we use the best searched workflow reported or released by AFlow and evaluate it on our held out test split\.

#### ADAS\.

ADAS\[Huet al\.,[2025](https://arxiv.org/html/2605.14212#bib.bib154)\]performs Meta Agent Search, where a meta agent writes executable Pythonforwardfunctions and adds evaluated agents to an archive\. We follow the official split and search protocol\. For MGSM and related reasoning domains, the official implementation shuffles all examples with seed 0, uses 128 examples for search validation, and uses the next 800 examples for final testing\. For GPQA diamond, it uses 32 validation examples and the remaining 166 examples for testing\. For ARC, it uses 20 validation tasks and 60 test tasks, with five repeated evaluations to reduce stochastic variance\.

The default MGSM search usesn\_generation=30,n\_repeat=1,max\_workers=48, and at most three debugging attempts for invalid generated code\. To align with our training budget, when we run ADAS search on a new benchmark without an official task specific searched agent, each generation evaluatesM=4M=4newly proposed candidate agents when supported by the search implementation, and each candidate agent is executedN=4N=4times for reward estimation\. The candidate agent executor follows the same Qwen3 4B or Qwen3 8B backbone as our method\. The reflection call uses temperature 0\.8\. The initial archive follows the released implementation and contains self consistency with Chain of Thought, Self Refine, LLM Debate, Step Back Abstraction, Quality Diversity, and Role Assignment\. For math, we use the best searched MGSM agent reported by ADAS,Dynamic Role Playing Architecture, and keep its original role routing and answer aggregation\. For scode, the official ADAS repository does not release a task specific searched code agent\. Therefore, when no official task specific searched code agent is available, we run Meta Agent Search on the corresponding optimization split with the same default ADAS search budget and select the best validation agent for held\-out test evaluation\.

### D\.2Semi Learning Based MAS Optimization Baselines

#### ScoreFlow\.

ScoreFlow\[Wanget al\.,[2025b](https://arxiv.org/html/2605.14212#bib.bib155)\]trains a workflow generator with Score DPO\. We follow the released training and inference pipeline, while aligning the candidate and execution budget with our setting\. For each optimization step, ScoreFlow generatesM=4M=4candidate workflows per query and executes each candidate workflowN=4N=4times to estimate its score\. The resulting workflow scores are used to construct preference pairs for Score DPO\. The held out test split is used only for final inference\. For inference, we use the trained checkpoint selected by the original validation protocol\.

We use the same stage length and learning rate as our method whenever ScoreFlow updates trainable parameters\. Specifically, optimization is organized into stages of lengthK=30K=30, and the workflow generator is updated with learning rate5×10−65\\times 10^\{\-6\}\. Only the trainable generator parameters are updated; executor parameters are frozen\. Score DPO is implemented with LoRA, with rank 8, alpha 16, dropout 0\.01, target modulesq\_projandv\_proj, and no bias\. The generator uses temperature 0\.2, topp=0\.95p=0\.95, and maximum generation length 1000\. The executor uses temperature 0\.0\. The vLLM setting uses bfloat16, GPU memory utilization 0\.9, and maximum model length 10000\. Each optimization epoch trains for 1 epoch with per device train batch size 1 and per device evaluation batch size 1\. Logging is performed every 10 steps, and LoRA weights are merged into the generator checkpoint after each epoch\.

#### MaAS\.

MaAS\[Zhanget al\.,[2025a](https://arxiv.org/html/2605.14212#bib.bib156)\]trains an agentic supernet and samples query dependent architectures from it\. We follow the official training and evaluation protocol\. Each benchmark is split into train and test sets with a1:41:4ratio\. The training split is used to optimize the controller distribution and agentic operators, while the held out test split is used only for final evaluation\. We match the architecture sampling and training budget to our method where possible\. For each query, MaAS samplesM=4M=4candidate architectures from the agentic supernet and executes each sampled architectureN=4N=4times for reward estimation\. We use a stage length ofK=30K=30and update the trainable controller and architecture parameters with learning rate5×10−65\\times 10^\{\-6\}\. The execution backbone remains fixed\. We keep the official architectural defaults: maximum supernet depthL=4L=4, sampling timesK=4K=4in the original MaAS notation, early exit threshold 0\.3, and cost penalty coefficient selected from\{1e−3,5e−3,1e−2\}\\\{1e\{\-\}3,5e\{\-\}3,1e\{\-\}2\\\}\. In the released command, optimization is run withsample=4; the same command is then rerun withis\_test=Truefor held out evaluation\. We use the best validation setting selected by the original protocol and report its held out test performance\.

#### Agent Foundation Model\.

Agent Foundation Model \(AFM\)\[Liet al\.,[2025a](https://arxiv.org/html/2605.14212#bib.bib153)\]is an end\-to\-end agent model trained under the Chain\-of\-Agents paradigm\. Instead of explicitly instantiating an external multi\-agent workflow at test time, AFM internalizes multi\-agent collaboration into a single model through multi\-agent distillation and agentic reinforcement learning\. This makes AFM different from our domain\-adaptive automatic MAS setting: AFM is a released agent foundation model trained with its own data and backbone, while our method learns to construct and execute task\-specific MAS using the Qwen3 4B or Qwen3 8B backbone\. Nevertheless, AFM is a relevant baseline because it represents a strong end\-to\-end alternative to explicit MAS optimization\.

We evaluate the officially released size\-comparable code\-agent checkpoint,AFM\-CodeAgent\-7B\-rl\. The released AFM model card also lists largerAFM\-CodeAgent\-32Bcheckpoints, but we use the 7B checkpoint to keep the comparison closer to our 8B experimental scale\. We follow the official AFM code\-agent evaluation framework\. We use the official code\-agent evaluation script with the released default parameters\. The maximum prompt length is set to40964096tokens, and the maximum response length is set to2867228672tokens\. The rollout usesn=8 samples,val\_kwargs\.temperature=0\.6, and multi\-turn tool use with at most 12 turns\.

## Appendix ECase Studies

### E\.1RL model Examples

We present three end\-to\-end trajectories of our system, illustrating how the*designer*chooses a team structure for each question and how the selected executors collaborate to reach the final answer\. Cases span both math and code domains and showcase three distinct team structures:single agent,ensemble \+ judge, andreflection with a separate critic\. The designer and executor boxes contain*verbatim*model output\. Only chain\-of\-thought stretches that do not affect the exposition are elided, marked by the literal symbol “…\\ldots\(…\)…\\ldots”\.

Case 1: Probability via single\-agent reasoning

Domain:Math \(AIME 2024 \#2\)

QuestionJen enters a lottery by picking44distinct numbers fromS=\{1,2,…,10\}S=\\\{1,2,\\dots,10\\\}\.44numbers are randomly chosen fromSS\. She wins a prize if at least two of her numbers were among the44chosen, and the*grand prize*if all four match\. The probability of winning the grand prize given that she won a prize ismn\\tfrac\{m\}\{n\}in lowest terms\. Findm\+nm\+n\.

Designer \(raw output\)Problem Type:MATHProblem Analysis: This is a probability problem involving conditional probability\.•Jen picks 4 distinct numbers fromS=\{1,2,…,10\}S=\\\{1,2,\\dots,10\\\}\.•4 numbers are randomly chosen fromSS\(presumably without replacement, since they are distinct\)\.•She wins a prize if at least 2 of her numbers match the 4 chosen numbers\.•She wins the grand prize if all 4 of her numbers match the 4 chosen numbers\.•We are asked: given that she won a prize, what is the probability she won the grand prize?•The answer ismn\\frac\{m\}\{n\}in lowest terms, and we needm\+nm\+n\.Mathematically: LetAA= event she wins the grand prize \(all 4 match\)\. LetBB= event she wins a prize \(at least 2 match\)\. We needP\(A∣B\)=P\(A\)P\(B\)P\(A\\mid B\)=\\frac\{P\(A\)\}\{P\(B\)\}\.We can compute:•Total ways to choose 4 numbers from 10:\(104\)\\binom\{10\}\{4\}\.•If Jen’s numbers are fixed \(say\{a,b,c,d\}\\\{a,b,c,d\\\}\), then:P\(A\)P\(A\)= probability the 4 chosen numbers are exactly Jen’s 4 numbers:1\(104\)\\frac\{1\}\{\\binom\{10\}\{4\}\}\.P\(B\)P\(B\)= probability at least 2 of the 4 chosen numbers are in Jen’s set\. This can be computed as1−P\(0 or 1 match\)1\-P\(\\text\{0 or 1 match\}\)\.We can also think: Given that she won a prize, the 4 chosen numbers are a 4\-element subset ofSSthat intersects her set in at least 2 elements\. We want the probability that the intersection is exactly 4 \(i\.e\., the chosen set equals her set\)\.This is a finite combinatorial problem, not requiring calculus or advanced probability theory\.Workflow Pattern:math\_single\_agent Reason:The problem is straightforward combinatorial probability\. A single expert mathematician can solve it step\-by\-step without needing multiple perspectives or iterative refinement\.Agent Design: OneMathSolveragent with a system prompt that emphasizes:•Understanding the problem \(conditional probability, combinatorial counting\)\.•ComputingP\(A\)P\(A\)andP\(B\)P\(B\)using combinatorial reasoning\.•Simplifying the fraction and findingm\+nm\+n\.•Outputting the final answer in the required boxed format\.Workflow Flow: The single agent reads the problem, reasons through it, and outputs the final answer\.<code\>[⬇](data:text/plain;base64,IyBNYXRoIFNpbmdsZSBBZ2VudCBTb2x2ZXIgZm9yIExvdHRlcnkgUHJvYmFiaWxpdHkgUHJvYmxlbQpmcm9tIHdvcmtmbG93IGltcG9ydCBBZ2VudE5vZGUsIFdvcmtmbG93LCBUb29sUmVnaXN0cnkKCiMgTm8gdG9vbHMgbmVlZGVkIC0gcHVyZSByZWFzb25pbmcKdG9vbF9yZWdpc3RyeSA9IFRvb2xSZWdpc3RyeSgpCgojIENyZWF0ZSBhIG1hdGggc29sdmVyIGFnZW50Cm1hdGhfYWdlbnQgPSBBZ2VudE5vZGUoCiAgICBuYW1lPSJNYXRoU29sdmVyIiwKICAgIHN5c3RlbV9wcm9tcHQ9KAogICAgICAgICJZb3UgYXJlIGFuIGV4cGVydCBtYXRoZW1hdGljaWFuIHdobyBzb2x2ZXMgcHJvYmxlbXMgdGhyb3VnaCBjYXJlZnVsIHJlYXNvbmluZy5cblxuIgoKICAgICAgICAiQVBQUk9BQ0g6XG4iCiAgICAgICAgIjEuIFJlYWQgdGhlIHByb2JsZW0gY2FyZWZ1bGx5IGFuZCBpZGVudGlmeSB3aGF0IGlzIGJlaW5nIGFza2VkXG4iCiAgICAgICAgIjIuIEJyZWFrIGRvd24gdGhlIHByb2JsZW0gaW50byBzbWFsbGVyIHN0ZXBzXG4iCiAgICAgICAgIjMuIEFwcGx5IHJlbGV2YW50IG1hdGhlbWF0aWNhbCBjb25jZXB0cyBhbmQgZm9ybXVsYXNcbiIKICAgICAgICAiNC4gU2hvdyB5b3VyIHdvcmsgc3RlcCBieSBzdGVwXG4iCiAgICAgICAgIjUuIFZlcmlmeSB5b3VyIGFuc3dlciBtYWtlcyBzZW5zZVxuXG4iCgogICAgICAgICJGb3IgdGhpcyBzcGVjaWZpYyBwcm9ibGVtOlxuIgogICAgICAgICItIEplbiBwaWNrcyA0IGRpc3RpbmN0IG51bWJlcnMgZnJvbSBTID0gezEsMiwuLi4sMTB9LlxuIgogICAgICAgICItIDQgbnVtYmVycyBhcmUgcmFuZG9tbHkgY2hvc2VuIGZyb20gUyAod2l0aG91dCByZXBsYWNlbWVudCkuXG4iCiAgICAgICAgIi0gU2hlIHdpbnMgYSBwcml6ZSBpZiBhdCBsZWFzdCAyIG9mIGhlciBudW1iZXJzIG1hdGNoIHRoZSA0IGNob3NlbiBudW1iZXJzLlxuIgogICAgICAgICItIFNoZSB3aW5zIHRoZSBncmFuZCBwcml6ZSBpZiBhbGwgNCBvZiBoZXIgbnVtYmVycyBtYXRjaCB0aGUgNCBjaG9zZW4gbnVtYmVycy5cbiIKICAgICAgICAiLSBXZSBuZWVkOiBQKGdyYW5kIHByaXplIHwgcHJpemUpID0gUChncmFuZCBwcml6ZSkgLyBQKHByaXplKS5cbiIKICAgICAgICAiLSBDb21wdXRlIHVzaW5nIGNvbWJpbmF0b3JpYWwgY291bnRpbmcuXG5cbiIKCiAgICAgICAgIk9VVFBVVCBGT1JNQVQ6XG4iCiAgICAgICAgIi0gU2hvdyB5b3VyIGNvbXBsZXRlIHJlYXNvbmluZyBwcm9jZXNzXG4iCiAgICAgICAgIi0gUHV0IHlvdXIgZmluYWwgbnVtZXJpY2FsIGFuc3dlciBpbiBcXGJveGVke30gZm9ybWF0XG4iCiAgICAgICAgIi0gRXhhbXBsZTogVGhlIGFuc3dlciBpcyBcXGJveGVkezQyfVxuXG4iCgogICAgICAgICJJTVBPUlRBTlQ6IFlvdXIgZmluYWwgYW5zd2VyIE1VU1QgYmUgaW4gXFxib3hlZHthbnN3ZXJ9IGZvcm1hdC4iCiAgICApLAogICAgdG9vbF9yZWdpc3RyeT10b29sX3JlZ2lzdHJ5LAogICAgbWF4X3R1cm5zPTEKKQoKIyBDcmVhdGUgd29ya2Zsb3cKd29ya2Zsb3cgPSBXb3JrZmxvdyhuYW1lPSJtYXRoX3NpbmdsZV9zb2x2ZXIiKQp3b3JrZmxvdy5hZGRfbm9kZShtYXRoX2FnZW50KQoKIyBEZWZpbmUgdGhlIHF1ZXN0aW9uCnF1ZXN0aW9uID0gKAogICAgIkplbiBlbnRlcnMgYSBsb3R0ZXJ5IGJ5IHBpY2tpbmcgNCBkaXN0aW5jdCBudW1iZXJzIGZyb20gUyA9IHsxLDIsMywuLi4sOSwxMH0uICIKICAgICI0IG51bWJlcnMgYXJlIHJhbmRvbWx5IGNob3NlbiBmcm9tIFMuIFNoZSB3aW5zIGEgcHJpemUgaWYgYXQgbGVhc3QgdHdvIG9mIGhlciBudW1iZXJzICIKICAgICJ3ZXJlIDIgb2YgdGhlIHJhbmRvbWx5IGNob3NlbiBudW1iZXJzLCBhbmQgd2lucyB0aGUgZ3JhbmQgcHJpemUgaWYgYWxsIGZvdXIgb2YgaGVyIG51bWJlcnMgIgogICAgIndlcmUgdGhlIHJhbmRvbWx5IGNob3NlbiBudW1iZXJzLiBUaGUgcHJvYmFiaWxpdHkgb2YgaGVyIHdpbm5pbmcgdGhlIGdyYW5kIHByaXplIGdpdmVuIHRoYXQgIgogICAgInNoZSB3b24gYSBwcml6ZSBpcyBtL24gd2hlcmUgbSBhbmQgbiBhcmUgcmVsYXRpdmVseSBwcmltZSBwb3NpdGl2ZSBpbnRlZ2Vycy4gRmluZCBtK24uIgopCgojIFJ1biB3b3JrZmxvdwpwcmludCgiPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09IikKcHJpbnQoIkZJTkFMIEFOU1dFUjoiKQpyZXN1bHQgPSB3b3JrZmxvdy5ydW4ocXVlc3Rpb24pCnByaW50KHJlc3VsdC5jb250ZW50KQpwcmludCgiPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09Iik=)fromworkflowimportAgentNode,Workflow,ToolRegistrytool\_registry=ToolRegistry\(\)math\_agent=AgentNode\(name="MathSolver",system\_prompt=\("Youareanexpertmathematicianwhosolvesproblemsthroughcarefulreasoning\.\\n\\n""APPROACH:\\n""1\.Readtheproblemcarefullyandidentifywhatisbeingasked\\n""2\.Breakdowntheproblemintosmallersteps\\n""3\.Applyrelevantmathematicalconceptsandformulas\\n""4\.Showyourworkstepbystep\\n""5\.Verifyyouranswermakessense\\n\\n""Forthisspecificproblem:\\n""\-Jenpicks4distinctnumbersfromS=\{1,2,\.\.\.,10\}\.\\n""\-4numbersarerandomlychosenfromS\(withoutreplacement\)\.\\n""\-Shewinsaprizeifatleast2ofhernumbersmatchthe4chosennumbers\.\\n""\-Shewinsthegrandprizeifall4ofhernumbersmatchthe4chosennumbers\.\\n""\-Weneed:P\(grandprize\|prize\)=P\(grandprize\)/P\(prize\)\.\\n""\-Computeusingcombinatorialcounting\.\\n\\n""OUTPUTFORMAT:\\n""\-Showyourcompletereasoningprocess\\n""\-Putyourfinalnumericalanswerin\\\\boxed\{\}format\\n""\-Example:Theansweris\\\\boxed\{42\}\\n\\n""IMPORTANT:YourfinalanswerMUSTbein\\\\boxed\{answer\}format\."\),tool\_registry=tool\_registry,max\_turns=1\)workflow=Workflow\(name="math\_single\_solver"\)workflow\.add\_node\(math\_agent\)question=\("Jenentersalotterybypicking4distinctnumbersfromS=\{1,2,3,\.\.\.,9,10\}\.""4numbersarerandomlychosenfromS\.Shewinsaprizeifatleasttwoofhernumbers""were2oftherandomlychosennumbers,andwinsthegrandprizeifallfourofhernumbers""weretherandomlychosennumbers\.Theprobabilityofherwinningthegrandprizegiventhat""shewonaprizeism/nwheremandnarerelativelyprimepositiveintegers\.Findm\+n\."\)print\("================================================"\)print\("FINALANSWER:"\)result=workflow\.run\(question\)print\(result\.content\)print\("================================================"\)</code\>

Executor: MathSolverProblem Type:MATHProblem Analysis: This is a probability problem involving combinatorial counting\. We need to find:•P\(grand prize∣prize\)=P\(grand prize\)/P\(prize\)P\(\\text\{grand prize\}\\mid\\text\{prize\}\)=P\(\\text\{grand prize\}\)/P\(\\text\{prize\}\)•Where:–Grand prize: all 4 numbers match–Prize: at least 2 numbers match•We need to compute using combinatorial counting from the setS=\{1,2,…,10\}S=\\\{1,2,\\dots,10\\\}•Then express asm/nm/nin lowest terms and findm\+nm\+nApproach:1\.Compute total number of ways to choose 4 numbers from 10:C\(10,4\)C\(10,4\)2\.ComputeP\(grand prize\)P\(\\text\{grand prize\}\): number of ways to match all 4 numbers3\.ComputeP\(prize\)P\(\\text\{prize\}\): number of ways to match exactly 2 or exactly 3 or exactly 4 numbers4\.Use conditional probability formula:P\(grand prize∣prize\)=P\(grand prize\)/P\(prize\)P\(\\text\{grand prize\}\\mid\\text\{prize\}\)=P\(\\text\{grand prize\}\)/P\(\\text\{prize\}\)5\.Simplify to lowest terms and findm\+nm\+nSolution:Let’s compute step by step:1\. Total number of ways to choose 4 numbers from 10: C\(10,4\)=10\!/\(4\!⋅6\!\)=\(10×9×8×7\)/\(4×3×2×1\)=210C\(10,4\)=10\!/\(4\!\\cdot 6\!\)=\(10\\times 9\\times 8\\times 7\)/\(4\\times 3\\times 2\\times 1\)=2102\.P\(grand prize\)P\(\\text\{grand prize\}\): All 4 numbers match•There’s only 1 way to choose exactly Jen’s 4 numbers•SoP\(grand prize\)=1/210P\(\\text\{grand prize\}\)=1/2103\.P\(prize\)P\(\\text\{prize\}\): At least 2 numbers match•This includes: exactly 2 matches, exactly 3 matches, exactly 4 matches•Exactly 4 matches: 1 way \(already counted\)•Exactly 3 matches: choose 3 of Jen’s numbers and 1 from the remaining 6 C\(4,3\)⋅C\(6,1\)=4⋅6=24C\(4,3\)\\cdot C\(6,1\)=4\\cdot 6=24•Exactly 2 matches: choose 2 of Jen’s numbers and 2 from the remaining 6 C\(4,2\)⋅C\(6,2\)=6⋅15=90C\(4,2\)\\cdot C\(6,2\)=6\\cdot 15=90•So total ways for prize=1\+24\+90=115=1\+24\+90=115•P\(prize\)=115/210P\(\\text\{prize\}\)=115/2104\.P\(grand prize∣prize\)=P\(grand prize\)/P\(prize\)=\(1/210\)/\(115/210\)=1/115P\(\\text\{grand prize\}\\mid\\text\{prize\}\)=P\(\\text\{grand prize\}\)/P\(\\text\{prize\}\)=\(1/210\)/\(115/210\)=1/1155\.1/1151/115is already in lowest terms \(11and115115are coprime\)•m=1m=1,n=115n=115•m\+n=1\+115=116m\+n=1\+115=116Approach:Compute probabilities using combinatorial counting of ways to match numbers\.Answer:116\\boxed\{116\}Confidence:High

Final Answer116\\boxed\{116\}ground truth:116116✓

Case 2: Disagreement resolved by an ensemble \+ judge

Domain:Math \(AIME 2024 \#6\)

QuestionFind the number of ways to place a digit in each cell of a2×32\\times 3grid so that the sum of the two numbers formed by reading left\-to\-right is999999and the sum of the three numbers formed by reading top\-to\-bottom is9999\.008991\(example:8\+991=999,9\+9\+81=99\)\\begin\{array\}\[\]\{\|c\|c\|c\|\}\\hline\\cr 0&0&8\\\\ \\hline\\cr 9&9&1\\\\ \\hline\\cr\\end\{array\}\\quad\\text\{\(example: \}8\+991=999,\\;9\+9\+81=99\\text\{\)\}

Designer \(raw output\)Problem Type:MATHProblem Analysis: We have a2×32\\times 3grid of digits \(0–9\)\.•The two numbers formed by reading left to right \(top row and bottom row\) must sum to999999\.•The three numbers formed by reading top to bottom \(first column, second column, third column\) must sum to9999\.We need to count the number of valid digit placements\.Let the grid be:abcdef\\begin\{array\}\[\]\{\|c\|c\|c\|\}\\hline\\cr a&b&c\\\\ \\hline\\cr d&e&f\\\\ \\hline\\cr\\end\{array\}Then:1\.Top row number:100a\+10b\+c100a\+10b\+c2\.Bottom row number:100d\+10e\+f100d\+10e\+f3\.Condition:\(100a\+10b\+c\)\+\(100d\+10e\+f\)=999\(100a\+10b\+c\)\+\(100d\+10e\+f\)=9994\.First column number:10a\+d10a\+d5\.Second column number:10b\+e10b\+e6\.Third column number:10c\+f10c\+f7\.Condition:\(10a\+d\)\+\(10b\+e\)\+\(10c\+f\)=99\(10a\+d\)\+\(10b\+e\)\+\(10c\+f\)=99We need to count integer solutions with0≤a,b,c,d,e,f≤90\\leq a,b,c,d,e,f\\leq 9\.This is a combinatorial counting problem with linear constraints\.Workflow Pattern:math\_ensemble\_judge Reason:The problem involves constraints that can be approached algebraically, combinatorially, or by case analysis\. Multiple perspectives can help verify correctness, and a judge can select the most reliable answer\.Agent Design:•AlgebraicSolver:Focus on solving the system of equations\.•GeometricSolver:Think in terms of digit constraints and possible ranges\.•CombinatorialSolver:Use counting arguments or case analysis\.•MathJudge:Compare the three solutions and pick the best\.Workflow Flow:1\.Each solver independently solves the problem\.2\.The judge receives all three solutions\.3\.The judge evaluates correctness and selects one answer\.4\.The judge outputs the final answer in\\boxed\{\\text\{\}\}format\.<code\>[⬇](data:text/plain;base64,ZnJvbSB3b3JrZmxvdyBpbXBvcnQgQWdlbnROb2RlLCBXb3JrZmxvdywgVG9vbFJlZ2lzdHJ5CmZyb20gd29ya2Zsb3cubm9kZXMgaW1wb3J0IEVuc2VtYmxlTm9kZQoKIyBObyB0b29scyBuZWVkZWQgLSBwdXJlIHJlYXNvbmluZwp0b29sX3JlZ2lzdHJ5ID0gVG9vbFJlZ2lzdHJ5KCkKCiMgQ3JlYXRlIHRocmVlIG1hdGggc29sdmVycyB3aXRoIGRpZmZlcmVudCBhcHByb2FjaGVzCnNvbHZlcjEgPSBBZ2VudE5vZGUoCiAgICBuYW1lPSJBbGdlYnJhaWNTb2x2ZXIiLAogICAgc3lzdGVtX3Byb21wdD0oCiAgICAgICAgIllvdSBhcmUgYSBtYXRoZW1hdGljaWFuIHdobyBwcmVmZXJzIGFsZ2VicmFpYyBhcHByb2FjaGVzLlxuIgogICAgICAgICJTb2x2ZSB0aGUgcHJvYmxlbSB1c2luZyBhbGdlYnJhaWMgbWFuaXB1bGF0aW9uIGFuZCBlcXVhdGlvbnMuXG4iCiAgICAgICAgIlNob3cgeW91ciByZWFzb25pbmcgc3RlcCBieSBzdGVwLlxuIgogICAgICAgICJQdXQgeW91ciBmaW5hbCBhbnN3ZXIgaW4gXFxib3hlZHt9IGZvcm1hdC5cblxuIgogICAgICAgICJQcm9ibGVtOiBGaW5kIHRoZSBudW1iZXIgb2Ygd2F5cyB0byBwbGFjZSBhIGRpZ2l0IGluIGVhY2ggY2VsbCBvZiBhIDJ4MyBncmlkIHNvIHRoYXQgIgogICAgICAgICJ0aGUgc3VtIG9mIHRoZSB0d28gbnVtYmVycyBmb3JtZWQgYnkgcmVhZGluZyBsZWZ0IHRvIHJpZ2h0IGlzIDk5OSwgIgogICAgICAgICJhbmQgdGhlIHN1bSBvZiB0aGUgdGhyZWUgbnVtYmVycyBmb3JtZWQgYnkgcmVhZGluZyB0b3AgdG8gYm90dG9tIGlzIDk5LlxuIgogICAgICAgICJMZXQgdGhlIGdyaWQgYmU6XG4iCiAgICAgICAgIlxcW1xcYmVnaW57YXJyYXl9e3xjfGN8Y3x9IFxcaGxpbmUgYSAmIGIgJiBjIFxcIFxcaGxpbmUgZCAmIGUgJiBmIFxcIFxcaGxpbmUgXFxlbmR7YXJyYXl9XFxdXG4iCiAgICAgICAgIlRoZW46XG4iCiAgICAgICAgIjEuIFRvcCByb3cgbnVtYmVyOiAxMDBhICsgMTBiICsgY1xuIgogICAgICAgICIyLiBCb3R0b20gcm93IG51bWJlcjogMTAwZCArIDEwZSArIGZcbiIKICAgICAgICAiMy4gQ29uZGl0aW9uOiAoMTAwYSArIDEwYiArIGMpICsgKDEwMGQgKyAxMGUgKyBmKSA9IDk5OVxuIgogICAgICAgICI0LiBGaXJzdCBjb2x1bW4gbnVtYmVyOiAxMGEgKyBkXG4iCiAgICAgICAgIjUuIFNlY29uZCBjb2x1bW4gbnVtYmVyOiAxMGIgKyBlXG4iCiAgICAgICAgIjYuIFRoaXJkIGNvbHVtbiBudW1iZXI6IDEwYyArIGZcbiIKICAgICAgICAiNy4gQ29uZGl0aW9uOiAoMTBhICsgZCkgKyAoMTBiICsgZSkgKyAoMTBjICsgZikgPSA5OVxuIgogICAgICAgICJDb3VudCBpbnRlZ2VyIHNvbHV0aW9ucyB3aXRoIDAgPD0gYSxiLGMsZCxlLGYgPD0gOS4iCiAgICApLAogICAgdG9vbF9yZWdpc3RyeT10b29sX3JlZ2lzdHJ5LAogICAgbWF4X3R1cm5zPTEKKQoKc29sdmVyMiA9IEFnZW50Tm9kZSgKICAgIG5hbWU9Ikdlb21ldHJpY1NvbHZlciIsCiAgICBzeXN0ZW1fcHJvbXB0PSgKICAgICAgICAiWW91IGFyZSBhIG1hdGhlbWF0aWNpYW4gd2hvIHRoaW5rcyBnZW9tZXRyaWNhbGx5IGFuZCB2aXN1YWxseS5cbiIKICAgICAgICAiU29sdmUgdGhlIHByb2JsZW0gdXNpbmcgZ2VvbWV0cmljIGludHVpdGlvbiBvciB2aXN1YWwgcmVhc29uaW5nLlxuIgogICAgICAgICMgLi4uIChzYW1lIHByb2JsZW0gc3RhdGVtZW50IGFzIHNvbHZlcjEpIC4uLgogICAgKSwKICAgIHRvb2xfcmVnaXN0cnk9dG9vbF9yZWdpc3RyeSwKICAgIG1heF90dXJucz0xCikKCnNvbHZlcjMgPSBBZ2VudE5vZGUoCiAgICBuYW1lPSJDb21iaW5hdG9yaWFsU29sdmVyIiwKICAgIHN5c3RlbV9wcm9tcHQ9KAogICAgICAgICJZb3UgYXJlIGEgbWF0aGVtYXRpY2lhbiBza2lsbGVkIGluIGNvdW50aW5nIGFuZCBjb21iaW5hdG9yaWNzLlxuIgogICAgICAgICJTb2x2ZSB0aGUgcHJvYmxlbSB1c2luZyBjb21iaW5hdG9yaWFsIGFyZ3VtZW50cyBvciBjYXNlIGFuYWx5c2lzLlxuIgogICAgICAgICMgLi4uIChzYW1lIHByb2JsZW0gc3RhdGVtZW50IGFzIHNvbHZlcjEpIC4uLgogICAgKSwKICAgIHRvb2xfcmVnaXN0cnk9dG9vbF9yZWdpc3RyeSwKICAgIG1heF90dXJucz0xCikKCiMgQ3JlYXRlIGp1ZGdlIGFnZW50IHRvIHNlbGVjdCB0aGUgYmVzdCBhbnN3ZXIKanVkZ2VfYWdlbnQgPSBBZ2VudE5vZGUoCiAgICBuYW1lPSJNYXRoSnVkZ2UiLAogICAgc3lzdGVtX3Byb21wdD0oCiAgICAgICAgIllvdSBhcmUgYSBzZW5pb3IgbWF0aGVtYXRpY2lhbiBhY3RpbmcgYXMgYSBqdWRnZS5cblxuIgogICAgICAgICJZb3Ugd2lsbCByZWNlaXZlIHNvbHV0aW9ucyBmcm9tIHRocmVlIGRpZmZlcmVudCBzb2x2ZXJzLlxuIgogICAgICAgICJZb3VyIHRhc2sgaXMgdG86XG4iCiAgICAgICAgIjEuIFJldmlldyBlYWNoIHNvbHV0aW9uIGNhcmVmdWxseVxuIgogICAgICAgICIyLiBDaGVjayB0aGUgcmVhc29uaW5nIGFuZCBjYWxjdWxhdGlvbnNcbiIKICAgICAgICAiMy4gSWRlbnRpZnkgd2hpY2ggc29sdXRpb24gaXMgY29ycmVjdCAob3IgbW9zdCBsaWtlbHkgY29ycmVjdClcbiIKICAgICAgICAiNC4gU2VsZWN0IE9ORSBhbnN3ZXIgYXMgdGhlIGZpbmFsIGFuc3dlclxuXG4iCiAgICAgICAgIk9VVFBVVCBGT1JNQVQ6XG4iCiAgICAgICAgIi0gQnJpZWZseSBleHBsYWluIHdoaWNoIHNvbHZlcidzIGFuc3dlciB5b3UgY2hvc2UgYW5kIHdoeVxuIgogICAgICAgICItIFB1dCB0aGUgc2VsZWN0ZWQgZmluYWwgYW5zd2VyIGluIFxcYm94ZWR7fSBmb3JtYXRcblxuIgogICAgICAgICJJTVBPUlRBTlQ6IFlvdSBtdXN0IHNlbGVjdCBPTkUgYW5zd2VyIGFuZCBvdXRwdXQgaXQgaW4gXFxib3hlZHthbnN3ZXJ9IGZvcm1hdC5cblxuIgogICAgICAgICMgLi4uIChzYW1lIHByb2JsZW0gc3RhdGVtZW50IGFzIHNvbHZlcjEpIC4uLgogICAgKSwKICAgIHRvb2xfcmVnaXN0cnk9dG9vbF9yZWdpc3RyeSwKICAgIG1heF90dXJucz0xCikKCiMgQ3JlYXRlIGVuc2VtYmxlIHdpdGgganVkZ2Ugc2VsZWN0aW9uCmVuc2VtYmxlID0gRW5zZW1ibGVOb2RlKAogICAgbmFtZT0iTWF0aEVuc2VtYmxlIiwKICAgIGFnZW50cz1bc29sdmVyMSwgc29sdmVyMiwgc29sdmVyM10sCiAgICBzdHJhdGVneT0iY29uc2Vuc3VzIiwKICAgIGNvbnNlbnN1c19hZ2VudD1qdWRnZV9hZ2VudAopCgojIENyZWF0ZSB3b3JrZmxvdwp3b3JrZmxvdyA9IFdvcmtmbG93KG5hbWU9Im1hdGhfZW5zZW1ibGVfanVkZ2UiKQp3b3JrZmxvdy5hZGRfbm9kZShlbnNlbWJsZSk=)fromworkflowimportAgentNode,Workflow,ToolRegistryfromworkflow\.nodesimportEnsembleNodetool\_registry=ToolRegistry\(\)solver1=AgentNode\(name="AlgebraicSolver",system\_prompt=\("Youareamathematicianwhoprefersalgebraicapproaches\.\\n""Solvetheproblemusingalgebraicmanipulationandequations\.\\n""Showyourreasoningstepbystep\.\\n""Putyourfinalanswerin\\\\boxed\{\}format\.\\n\\n""Problem:Findthenumberofwaystoplaceadigitineachcellofa2x3gridsothat""thesumofthetwonumbersformedbyreadinglefttorightis999,""andthesumofthethreenumbersformedbyreadingtoptobottomis99\.\\n""Letthegridbe:\\n""\\\\\[\\\\begin\{array\}\{\|c\|c\|c\|\}\\\\hlinea&b&c\\\\\\\\hlined&e&f\\\\\\\\hline\\\\end\{array\}\\\\\]\\n""Then:\\n""1\.Toprownumber:100a\+10b\+c\\n""2\.Bottomrownumber:100d\+10e\+f\\n""3\.Condition:\(100a\+10b\+c\)\+\(100d\+10e\+f\)=999\\n""4\.Firstcolumnnumber:10a\+d\\n""5\.Secondcolumnnumber:10b\+e\\n""6\.Thirdcolumnnumber:10c\+f\\n""7\.Condition:\(10a\+d\)\+\(10b\+e\)\+\(10c\+f\)=99\\n""Countintegersolutionswith0<=a,b,c,d,e,f<=9\."\),tool\_registry=tool\_registry,max\_turns=1\)solver2=AgentNode\(name="GeometricSolver",system\_prompt=\("Youareamathematicianwhothinksgeometricallyandvisually\.\\n""Solvetheproblemusinggeometricintuitionorvisualreasoning\.\\n"\),tool\_registry=tool\_registry,max\_turns=1\)solver3=AgentNode\(name="CombinatorialSolver",system\_prompt=\("Youareamathematicianskilledincountingandcombinatorics\.\\n""Solvetheproblemusingcombinatorialargumentsorcaseanalysis\.\\n"\),tool\_registry=tool\_registry,max\_turns=1\)judge\_agent=AgentNode\(name="MathJudge",system\_prompt=\("Youareaseniormathematicianactingasajudge\.\\n\\n""Youwillreceivesolutionsfromthreedifferentsolvers\.\\n""Yourtaskisto:\\n""1\.Revieweachsolutioncarefully\\n""2\.Checkthereasoningandcalculations\\n""3\.Identifywhichsolutioniscorrect\(ormostlikelycorrect\)\\n""4\.SelectONEanswerasthefinalanswer\\n\\n""OUTPUTFORMAT:\\n""\-Brieflyexplainwhichsolver’sansweryouchoseandwhy\\n""\-Puttheselectedfinalanswerin\\\\boxed\{\}format\\n\\n""IMPORTANT:YoumustselectONEanswerandoutputitin\\\\boxed\{answer\}format\.\\n\\n"\),tool\_registry=tool\_registry,max\_turns=1\)ensemble=EnsembleNode\(name="MathEnsemble",agents=\[solver1,solver2,solver3\],strategy="consensus",consensus\_agent=judge\_agent\)workflow=Workflow\(name="math\_ensemble\_judge"\)workflow\.add\_node\(ensemble\)</code\>\(In the actual file, the three solversystem\_prompts contain the same problem statement verbatim; collapsed here as “…\\ldotssame problem statement…\\ldots” to save space\.\)

Executor: Solver 1 – AlgebraicSolver✗Problem Type:MATHApproach:We have:\(1\)100a\+10b\+c\+100d\+10e\+f=999100a\+10b\+c\+100d\+10e\+f=999\(2\)10a\+d\+10b\+e\+10c\+f=9910a\+d\+10b\+e\+10c\+f=99We can treat these as two equations in six variables, but we can also think of them as constraints on the digits\. Let’s denote: Equation \(1\):100\(a\+d\)\+10\(b\+e\)\+\(c\+f\)=999100\(a\+d\)\+10\(b\+e\)\+\(c\+f\)=999\. Equation \(2\):10\(a\+c\)\+\(d\+e\)\+\(b\+f\)=9910\(a\+c\)\+\(d\+e\)\+\(b\+f\)=99\.…the solver continues from this rewriting of \(2\)…SoX=a\+d=9X=a\+d=9,Y=b\+e=9Y=b\+e=9,Z=c\+f=9Z=c\+f=9\. We also have the second equation:10\(a\+c\)\+\(d\+e\)\+\(b\+f\)=9910\(a\+c\)\+\(d\+e\)\+\(b\+f\)=99\. We can check if this is automatically satisfied givena\+d=9a\+d=9,b\+e=9b\+e=9,c\+f=9c\+f=9\. Let’s verify: froma\+d=9a\+d=9,d=9−ad=9\-a\. Fromb\+e=9b\+e=9,e=9−be=9\-b\. Fromc\+f=9c\+f=9,f=9−cf=9\-c\. Thena\+ca\+cis justa\+ca\+c\.d\+e=\(9−a\)\+\(9−b\)=18−a−bd\+e=\(9\-a\)\+\(9\-b\)=18\-a\-b\.b\+f=b\+\(9−c\)=9\+b−cb\+f=b\+\(9\-c\)=9\+b\-c\. So10\(a\+c\)\+\(18−a−b\)\+\(9\+b−c\)=9a\+9c\+27\.10\(a\+c\)\+\(18\-a\-b\)\+\(9\+b\-c\)\\;=\\;9a\+9c\+27\.We want this to equal9999:9a\+9c=729a\+9c=72, soa\+c=8a\+c=8\.Possible\(a,c\)\(a,c\):\(0,8\),\(1,7\),…,\(8,0\)\(0,8\),\(1,7\),\\dots,\(8,0\)–99possibilities\. For eachbbfrom0to99,e=9−be=9\-b–1010possibilities\. Thus total=9×10=90=9\\times 10=90\.Answer:90\\boxed\{90\}Confidence:High

Annotation: this solver miscopied equation \(2\) as10\(a\+c\)\+\(d\+e\)\+\(b\+f\)=9910\(a\{\+\}c\)\+\(d\{\+\}e\)\+\(b\{\+\}f\)=99instead of the actual10\(a\+b\+c\)\+\(d\+e\+f\)=9910\(a\{\+\}b\{\+\}c\)\+\(d\{\+\}e\{\+\}f\)=99– thebbcoefficient is lost\. This propagates to the wrong count of9090\.

Executor: Solver 2 – GeometricSolver✓Problem Type:MATHGeometric/Visual Reasoning Approach: LetA=a\+dA=a\+d,B=b\+eB=b\+e,C=c\+fC=c\+f\. Then \(1\) becomes100A\+10B\+C=999100A\+10B\+C=999\. SinceA,B,C∈\[0,18\]A,B,C\\in\[0,18\], we getA=9,B=9,C=9A=9,B=9,C=9, hencea\+d=9,b\+e=9,c\+f=9\.a\+d=9,\\quad b\+e=9,\\quad c\+f=9\.Now from \(2\):10\(a\+b\+c\)\+\(d\+e\+f\)=9910\(a\+b\+c\)\+\(d\+e\+f\)=99\. Butd\+e\+f=\(9−a\)\+\(9−b\)\+\(9−c\)=27−\(a\+b\+c\)d\+e\+f=\(9\-a\)\+\(9\-b\)\+\(9\-c\)=27\-\(a\+b\+c\)\. Substitute:10\(a\+b\+c\)\+27−\(a\+b\+c\)=99⟹9\(a\+b\+c\)=72⟹a\+b\+c=8\.10\(a\+b\+c\)\+27\-\(a\+b\+c\)=99\\;\\Longrightarrow\\;9\(a\+b\+c\)=72\\;\\Longrightarrow\\;a\+b\+c=8\.That’s a stars\-and\-bars count of nonnegative integer solutions toa\+b\+c=8a\+b\+c=8with0≤a,b,c≤90\\leq a,b,c\\leq 9\. Since8≤98\\leq 9, the upper bound is not restrictive, so the number is\(8\+3−13−1\)=\(102\)=45\\binom\{8\+3\-1\}\{3\-1\}=\\binom\{10\}\{2\}=45\.Answer:45\\boxed\{45\}Confidence:High

Executor: Solver 3 – CombinatorialSolver \(no clean answer\)Problem Type:MATHProblem Analysis: We have a2×32\\times 3grid with digitsa,b,ca,b,cin the top row andd,e,fd,e,fin the bottom row, each from0to99\. Conditions:\(1\)\(100a\+10b\+c\)\+\(100d\+10e\+f\)=999\(100a\+10b\+c\)\+\(100d\+10e\+f\)=999\(2\)\(10a\+d\)\+\(10b\+e\)\+\(10c\+f\)=99\(10a\+d\)\+\(10b\+e\)\+\(10c\+f\)=99We need to count integer solutions\(a,b,c,d,e,f\)\(a,b,c,d,e,f\)with0≤a,b,c,d,e,f≤90\\leq a,b,c,d,e,f\\leq 9\.This is acombinatorial counting problemwith constraints\. We can approach by interpreting the constraints as digit\-wise equations \(mod 10 and carry\-over\), using case analysis on possible carries, or solving the system algebraically and counting integer solutions\.Approach:we will treat the two conditions as digit\-wise constraints with possible carries\. \[ … \]…long case analysis onc\+f∈\{9,19\}c\{\+\}f\\in\\\{9,19\\\},b\+e∈\{9,19\}b\{\+\}e\\in\\\{9,19\\\},a\+d∈\{9,19\}a\{\+\}d\\in\\\{9,19\\\}derives mutually inconsistent constraints…SoS2=9S\_\{2\}=9,S3=9S\_\{3\}=9\. ButS2=b\+e=9S\_\{2\}=b\+e=9,S3=c\+f=9S\_\{3\}=c\+f=9\. Thena\+b\+c=9a\+b\+c=9anda\+b\+c=18a\+b\+c=18\. So we must have a carry from the tens place? Let’s think: we have10S2\+S3=9910S\_\{2\}\+S\_\{3\}=99\. SoS2=9S\_\{2\}=9,S3=9S\_\{3\}=9\. ButS2=b\+e=9S\_\{2\}=b\+e=9,S3=c\+f=9S\_\{3\}=c\+f=9\. Thena\+b\+c=9a\+b\+c=9anda\+b\+c=18a\+b\+c=18\. So we must have a carry from the tens place? Let’s think: we have10S2\+S3=9910S\_\{2\}\+S\_\{3\}=99\. SoS2=9S\_\{2\}=9,S3=9S\_\{3\}=9\. ButS2=b\+e=9S\_\{2\}=b\+e=9,S3=c\+f=9S\_\{3\}=c\+f=9\. Thena\+b\+c=9a\+b\+c=9anda\+b\+c=a\+b\+c=…the model loops on this passage until the token budget is exhausted; no \\boxed\{\} answer is emitted…

JudgeKey Insight: We can derive constraints from the equations, defineX=a\+d,Y=b\+e,Z=c\+fX=a\+d,Y=b\+e,Z=c\+f, solve100X\+10Y\+Z=999100X\+10Y\+Z=999to getX=9,Y=9,Z=9X=9,Y=9,Z=9, then use the column equation10\(a\+b\+c\)\+\(d\+e\+f\)=9910\(a\+b\+c\)\+\(d\+e\+f\)=99together withd\+e\+f=27−\(a\+b\+c\)d\+e\+f=27\-\(a\+b\+c\)to geta\+b\+c=8a\+b\+c=8\. Counting nonnegative integer solutions toa\+b\+c=8a\+b\+c=8with the unrestrictive bounda,b,c≤9a,b,c\\leq 9:\(8\+3−13−1\)=\(102\)=45\\binom\{8\+3\-1\}\{3\-1\}=\\binom\{10\}\{2\}=45\.Reason for selection:Solution 2 \(GeometricSolver\) correctly derived the constraints and counted the solutions\. Solution 1 \(AlgebraicSolver\) arrives at9090but that is incorrect because it misinterprets the second condition\. Solution 3 \(CombinatorialSolver\) has a flawed reasoning chain with repeated contradictions\.Answer:45\\boxed\{45\}Confidence:High

Final Answer45\\boxed\{45\}ground truth:4545✓\. The judge correctly overrides Solver 1’s mistake and the unfinished Solver 3\.

Case 3: Array eversion via reflection \(solver\+\+critic\)

Domain:Code \(CodeContests \#46\)

QuestionArray eversion\.You are given an arrayaaof lengthnn\. An*eversion*takes the last elementx=anx=a\_\{n\}, partitionsaainto a left part \(elements≤x\\leq x, in order\) and a right part \(elements\>x\>x, in order\), and replacesaawith their concatenation\. Output the minimum number of eversionskksuch that the array stops changing afterkkeversions\. *Example:*\[2,4,1,5,3\]→\[2,1,3,4,5\]\[2,4,1,5,3\]\\to\[2,1,3,4,5\], then unchanged:k=1k=1\. Constraints:1≤n≤2⋅1051\\leq n\\leq 2\{\\cdot\}10^\{5\},1≤ai≤1091\\leq a\_\{i\}\\leq 10^\{9\}\.

Designer \(raw output\)Problem Type:CODEProblem Analysis: This problem asks us to find the minimum number of "eversions" needed for an array to stop changing\. An eversion operation:1\.Takes the last elementx = a\_n2\.Partitions the array into two stable parts: elements≤x\\leq x\(left\) and elements\>x\>x\(right\)3\.Concatenates left\+\+rightThe process stops when an eversion doesn’t change the array\. We need to compute the number of eversions until this happens\.Key observations:•Each eversion moves the last element to the end of the left partition \(all elements≤\\leqit\)•The array eventually becomes sorted in non\-decreasing order•The process stops when the array is already sorted \(since then the last element is the maximum, and all elements≤\\leqit\)•We need to count how many eversions it takes to reach a sorted arrayApproach: We can simulate the process, but fornnup to2×1052\\times 10^\{5\}, we need an efficient algorithm\. Let’s think about what happens:•The last element in the current array is moved to the end of the left partition•This is equivalent to: find the last element that is≤\\leqthe current last element, and move it to the end•Actually, the eversion operation is: take the last elementx, move all elements≤x\\leq xto the front \(preserving order\), then append the remaining elementsWe can think of it as: after each eversion, the array becomes\[all elements≤x\\leq x\] \+ \[all elements\>x\>x\]wherexis the original last element\. We need to find how many such operations until the array is sorted\.Workflow Pattern:code\_solver\_unittest This problem has subtle behaviour \(stable partition, counting eversions\)\. A single solver might miss edge cases\. The reflection pattern with aUnitTestAgentwill ensure thorough testing\.Agent Design:1\.CodeSolver: implements the solution, tests with examples2\.UnitTestAgent: writes comprehensive tests including edge casesWorkflow Flow:1\.CodeSolver writes initial solution2\.UnitTestAgent tests it thoroughly3\.CodeSolver refines based on feedback4\.UnitTestAgent verifies again5\.Final solution output<code\>[⬇](data:text/plain;base64,ZnJvbSB3b3JrZmxvdyBpbXBvcnQgQWdlbnROb2RlLCBXb3JrZmxvdywgVG9vbFJlZ2lzdHJ5CmZyb20gd29ya2Zsb3cubm9kZXMgaW1wb3J0IFJlZmxlY3Rpb25Ob2RlCmltcG9ydCBzdWJwcm9jZXNzCgojIFNldCB1cCBjb2RlIGludGVycHJldGVyIHRvb2wKdG9vbF9yZWdpc3RyeSA9IFRvb2xSZWdpc3RyeSgpCgpkZWYgZXhlY3V0ZV9jb2RlKGNvZGU6IHN0cikgLT4gc3RyOgogICAgIiIiRXhlY3V0ZSBQeXRob24gY29kZSBhbmQgcmV0dXJuIHN0ZG91dC9zdGRlcnIuIiIiCiAgICB0cnk6CiAgICAgICAgcmVzdWx0ID0gc3VicHJvY2Vzcy5ydW4oCiAgICAgICAgICAgIFsicHl0aG9uIiwgIi1jIiwgY29kZV0sCiAgICAgICAgICAgIGNhcHR1cmVfb3V0cHV0PVRydWUsIHRleHQ9VHJ1ZSwgdGltZW91dD0zMAogICAgICAgICkKICAgICAgICBvdXRwdXQgPSAiIgogICAgICAgIGlmIHJlc3VsdC5zdGRvdXQ6CiAgICAgICAgICAgIG91dHB1dCArPSBmInN0ZG91dDpcbntyZXN1bHQuc3Rkb3V0fVxuIgogICAgICAgIGlmIHJlc3VsdC5zdGRlcnI6CiAgICAgICAgICAgIG91dHB1dCArPSBmInN0ZGVycjpcbntyZXN1bHQuc3RkZXJyfVxuIgogICAgICAgIHJldHVybiBvdXRwdXQgb3IgIihubyBvdXRwdXQpIgogICAgZXhjZXB0IHN1YnByb2Nlc3MuVGltZW91dEV4cGlyZWQ6CiAgICAgICAgcmV0dXJuICJFcnJvcjogQ29kZSBleGVjdXRpb24gdGltZWQgb3V0ICgzMHMgbGltaXQpIgoKdG9vbF9yZWdpc3RyeS5yZWdpc3RlcigKICAgIG5hbWU9ImV4ZWN1dGVfY29kZSIsCiAgICBmdW5jPWV4ZWN1dGVfY29kZSwKICAgIGRlc2NyaXB0aW9uPSJFeGVjdXRlIFB5dGhvbiBjb2RlIGFuZCByZXR1cm4gc3Rkb3V0L3N0ZGVyciIsCiAgICBwYXJhbWV0ZXJzPXsKICAgICAgICAidHlwZSI6ICJvYmplY3QiLAogICAgICAgICJwcm9wZXJ0aWVzIjogewogICAgICAgICAgICAiY29kZSI6IHsidHlwZSI6ICJzdHJpbmciLCAiZGVzY3JpcHRpb24iOiAiUHl0aG9uIGNvZGUgdG8gZXhlY3V0ZSJ9CiAgICAgICAgfSwKICAgICAgICAicmVxdWlyZWQiOiBbImNvZGUiXQogICAgfQopCgojIFNvbHZlciBhZ2VudCB3cml0ZXMgdGhlIHNvbHV0aW9uCnNvbHZlcl9hZ2VudCA9IEFnZW50Tm9kZSgKICAgIG5hbWU9IkNvZGVTb2x2ZXIiLAogICAgc3lzdGVtX3Byb21wdD0oCiAgICAgICAgIllvdSBhcmUgYW4gZXhwZXJ0IHByb2dyYW1tZXIuIFNvbHZlIHRoZSBjb2RpbmcgcHJvYmxlbSBzdGVwIGJ5IHN0ZXAuXG5cbiIKICAgICAgICAiUFJPQkxFTTogR2l2ZW4gYW4gYXJyYXkgYSBvZiBsZW5ndGggbiwgcGVyZm9ybSBldmVyc2lvbnMgdW50aWwgdGhlIGFycmF5IHN0b3BzIGNoYW5naW5nLiAiCiAgICAgICAgIkFuIGV2ZXJzaW9uOiB0YWtlIGxhc3QgZWxlbWVudCB4LCBwYXJ0aXRpb24gYXJyYXkgaW50byBlbGVtZW50cyA8PSB4IChsZWZ0KSBhbmQgPiB4IChyaWdodCksICIKICAgICAgICAiY29uY2F0ZW5hdGUgbGVmdCArIHJpZ2h0LiBGaW5kIG1pbmltdW0gayBzdWNoIHRoYXQgYWZ0ZXIgayBldmVyc2lvbnMgdGhlIGFycmF5IHN0b3BzIGNoYW5naW5nLlxuXG4iCiAgICAgICAgIllvdSBoYXZlIGFjY2VzcyB0byBhbiBgZXhlY3V0ZV9jb2RlYCB0b29sLiBVc2UgaXQgdG8gdGVzdCB5b3VyIHNvbHV0aW9uOlxuIgogICAgICAgICc8dG9vbF9jYWxsPnsibmFtZSI6ICJleGVjdXRlX2NvZGUiLCAiYXJndW1lbnRzIjogeyJjb2RlIjogInlvdXJfcHl0aG9uX2NvZGVfaGVyZSJ9fTwvdG9vbF9jYWxsPlxuXG4nCiAgICAgICAgIldPUktGTE9XOlxuIgogICAgICAgICIxLiBVbmRlcnN0YW5kIHRoZSBwcm9ibGVtIGFuZCBjb25zdHJhaW50c1xuIgogICAgICAgICIyLiBXcml0ZSB5b3VyIHNvbHV0aW9uXG4iCiAgICAgICAgIjMuIFRlc3QgaXQgd2l0aCB0aGUgcHJvdmlkZWQgZXhhbXBsZXMgdXNpbmcgZXhlY3V0ZV9jb2RlXG4iCiAgICAgICAgIjQuIElmIHlvdSByZWNlaXZlIGZlZWRiYWNrIGZyb20gYSB0ZXN0ZXIsIGZpeCB0aGUgaWRlbnRpZmllZCBpc3N1ZXNcbiIKICAgICAgICAiNS4gV3JhcCB5b3VyIGZpbmFsIHNvbHV0aW9uIGluIDxzb2x1dGlvbj4uLi48L3NvbHV0aW9uPiB0YWdzXG4iCiAgICApLAogICAgdG9vbF9yZWdpc3RyeT10b29sX3JlZ2lzdHJ5LAogICAgbWF4X3R1cm5zPTUKKQoKIyBVbml0IHRlc3QgYWdlbnQgd3JpdGVzIGFuZCBydW5zIGNvbXByZWhlbnNpdmUgdGVzdHMKY3JpdGljX2FnZW50ID0gQWdlbnROb2RlKAogICAgbmFtZT0iVW5pdFRlc3RBZ2VudCIsCiAgICBzeXN0ZW1fcHJvbXB0PSgKICAgICAgICAiWW91IGFyZSBhbiBleHBlcnQgc29mdHdhcmUgdGVzdGVyLiBZb3VyIGpvYiBpcyB0byB2ZXJpZnkgYSBwcm9wb3NlZCBzb2x1dGlvbiAiCiAgICAgICAgImJ5IHdyaXRpbmcgYW5kIHJ1bm5pbmcgY29tcHJlaGVuc2l2ZSB1bml0IHRlc3RzLlxuXG4iCiAgICAgICAgIk9SSUdJTkFMIFBST0JMRU06IEdpdmVuIGFuIGFycmF5IGEgb2YgbGVuZ3RoIG4sIHBlcmZvcm0gZXZlcnNpb25zIHVudGlsIHRoZSBhcnJheSBzdG9wcyBjaGFuZ2luZy4gIgogICAgICAgICIuLi5cblxuIgogICAgICAgICJXT1JLRkxPVzpcbiIKICAgICAgICAiMS4gUmVhZCB0aGUgT1JJR0lOQUwgUFJPQkxFTSBhbmQgdGhlIHByb3Bvc2VkIHNvbHV0aW9uXG4iCiAgICAgICAgIjIuIFdyaXRlIGNvbXByZWhlbnNpdmUgdGVzdCBjYXNlcyBjb3ZlcmluZyBhbGwgZXhhbXBsZXMgKyBlZGdlIGNhc2VzXG4iCiAgICAgICAgIjMuIFJ1biB0aGUgdGVzdHMgdXNpbmcgZXhlY3V0ZV9jb2RlXG4iCiAgICAgICAgIjQuIFJlcG9ydCByZXN1bHRzIGNsZWFybHlcblxuIgogICAgICAgICJPVVRQVVQgRk9STUFUOlxuIgogICAgICAgICItIExpc3QgZWFjaCB0ZXN0IGNhc2UgYW5kIGl0cyByZXN1bHQgKFBBU1MvRkFJTClcbiIKICAgICAgICAiLSBJZiBhbnkgdGVzdCBmYWlscywgZXhwbGFpbiB3aGF0IHdlbnQgd3JvbmdcbiIKICAgICAgICAiLSBGaW5hbCB2ZXJkaWN0OiBDT1JSRUNUIG9yIElOQ09SUkVDVFxuIgogICAgKSwKICAgIHRvb2xfcmVnaXN0cnk9dG9vbF9yZWdpc3RyeSwKICAgIG1heF90dXJucz01CikKCiMgU29sdmVyIC0+IFRlc3RlciAtPiBTb2x2ZXIgcmVmaW5lcyAtPiBUZXN0ZXIgLT4gZmluYWwKcmVmbGVjdGlvbiA9IFJlZmxlY3Rpb25Ob2RlKAogICAgbmFtZT0iQ29kZVNvbHZlclRlc3RSZWZsZWN0aW9uIiwKICAgIGFnZW50PXNvbHZlcl9hZ2VudCwKICAgIGNyaXRpY19hZ2VudD1jcml0aWNfYWdlbnQsCiAgICBudW1faXRlcmF0aW9ucz0yCikKCndvcmtmbG93ID0gV29ya2Zsb3cobmFtZT0iY29kZV9zb2x2ZXJfdW5pdHRlc3QiKQp3b3JrZmxvdy5hZGRfbm9kZShyZWZsZWN0aW9uKQ==)fromworkflowimportAgentNode,Workflow,ToolRegistryfromworkflow\.nodesimportReflectionNodeimportsubprocesstool\_registry=ToolRegistry\(\)defexecute\_code\(code:str\)\-\>str:"""ExecutePythoncodeandreturnstdout/stderr\."""try:result=subprocess\.run\(\["python","\-c",code\],capture\_output=True,text=True,timeout=30\)output=""ifresult\.stdout:output\+=f"stdout:\\n\{result\.stdout\}\\n"ifresult\.stderr:output\+=f"stderr:\\n\{result\.stderr\}\\n"returnoutputor"\(nooutput\)"exceptsubprocess\.TimeoutExpired:return"Error:Codeexecutiontimedout\(30slimit\)"tool\_registry\.register\(name="execute\_code",func=execute\_code,description="ExecutePythoncodeandreturnstdout/stderr",parameters=\{"type":"object","properties":\{"code":\{"type":"string","description":"Pythoncodetoexecute"\}\},"required":\["code"\]\}\)solver\_agent=AgentNode\(name="CodeSolver",system\_prompt=\("Youareanexpertprogrammer\.Solvethecodingproblemstepbystep\.\\n\\n""PROBLEM:Givenanarrayaoflengthn,performeversionsuntilthearraystopschanging\.""Aneversion:takelastelementx,partitionarrayintoelements<=x\(left\)and\>x\(right\),""concatenateleft\+right\.Findminimumksuchthatafterkeversionsthearraystopschanging\.\\n\\n""Youhaveaccesstoan‘execute\_code‘tool\.Useittotestyoursolution:\\n"’<tool\_call\>\{"name":"execute\_code","arguments":\{"code":"your\_python\_code\_here"\}\}</tool\_call\>\\n\\n’"WORKFLOW:\\n""1\.Understandtheproblemandconstraints\\n""2\.Writeyoursolution\\n""3\.Testitwiththeprovidedexamplesusingexecute\_code\\n""4\.Ifyoureceivefeedbackfromatester,fixtheidentifiedissues\\n""5\.Wrapyourfinalsolutionin<solution\>\.\.\.</solution\>tags\\n"\),tool\_registry=tool\_registry,max\_turns=5\)critic\_agent=AgentNode\(name="UnitTestAgent",system\_prompt=\("Youareanexpertsoftwaretester\.Yourjobistoverifyaproposedsolution""bywritingandrunningcomprehensiveunittests\.\\n\\n""ORIGINALPROBLEM:Givenanarrayaoflengthn,performeversionsuntilthearraystopschanging\.""\.\.\.\\n\\n""WORKFLOW:\\n""1\.ReadtheORIGINALPROBLEMandtheproposedsolution\\n""2\.Writecomprehensivetestcasescoveringallexamples\+edgecases\\n""3\.Runthetestsusingexecute\_code\\n""4\.Reportresultsclearly\\n\\n""OUTPUTFORMAT:\\n""\-Listeachtestcaseanditsresult\(PASS/FAIL\)\\n""\-Ifanytestfails,explainwhatwentwrong\\n""\-Finalverdict:CORRECTorINCORRECT\\n"\),tool\_registry=tool\_registry,max\_turns=5\)reflection=ReflectionNode\(name="CodeSolverTestReflection",agent=solver\_agent,critic\_agent=critic\_agent,num\_iterations=2\)workflow=Workflow\(name="code\_solver\_unittest"\)workflow\.add\_node\(reflection\)</code\>

Executor: CodeSolver – iteration 1I need to understand the problem better\. Let me think about the eversion operation:1\.Take the last elementx2\.Partition array into elements≤x\\leq x\(left\) and\>x\>x\(right\), maintaining original order3\.Concatenate left\+\+rightWe need to find the minimumkksuch that afterkkeversions the array stops changing\.Let me test with the example\. First example:\[2,4,1,5,3\]\[2,4,1,5,3\]\. Last element is 3\. Partition: elements≤3\\leq 3are\[2,1,3\]\[2,1,3\], elements\>3\>3are\[4,5\]\[4,5\]\. Result:\[2,1,3,4,5\]\[2,1,3,4,5\]\.…similar walk\-through of the second and third examples…Wait, let me think: when we partition, the last elementxxis in the left partition \(since it’s≤x\\leq x\)\. So after partitioning,xxwill be at the end of the left partition\. Then we concatenate left \+ right\. So the last element of the array after eversion is the last element of the left partition, which isxx\. So the last element remains the same\!That means the array will stop changing when the last element is the maximum element in the array\.*Let me write a function to simulate the eversion and see what’s happening\.*Tool call:execute\_codewith a brute\-force simulator\.

stdout: First example: Original: \[2, 4, 1, 5, 3\]; k: 1 Second example: Original: \[5, 3, 2, 4, 1\]; k: 2 Third example: Original: \[1, 1, 1, 1\]; k: 0

Judge\(UnitTestAgent\) I’ll analyze the response to the problem\. Let me first understand the problem and then critique the proposed solution\.Understanding the Problem: We need to find the minimum number of eversions \(partition operation\) until the array stops changing\. An eversion takes the last elementxx, partitions the array into elements≤x\\leq x\(left\) and\>x\>x\(right\), then concatenates left\+\+right\.Key Insight from the Response: The response correctly identifies that the last element changes in each eversion\. Specifically, after an eversion, the new last element becomes the last element of the right partition \(elements\>x\>x\)\. This means we need to find how many steps it takes for the last element to become the maximum element in the array\.Critique of the Response:1\.Logical Flaw:the response has a critical error in its reasoning\. It claims that “if current<M<M, thenMMis the maximum element\>\>current”, but this is not true\. In the second example, when current=1=1,M=5M=5, but the maximum element\>1\>1is44, not55\. So the next current should be44, notMM\.2\.Incomplete Algorithm:the response doesn’t provide a complete algorithm\. It only describes the problem and the insight but doesn’t give a working solution\.3\.Missing Edge Cases:the response doesn’t consider edge cases like duplicate elements, already sorted arrays, or arrays where all elements are equal\.4\.No Code Implementation:the response doesn’t provide any code implementation, making it impossible to test or use\.5\.Confusion in Terminology:the response uses “max of elements\>\>current” but doesn’t explain how to compute this efficiently\.Let me write a complete solution and test it:…execute\_codetool call with a candidatemin\_eversionsfunction and tests on the three sample inputs…

Executor: CodeSolver – final solutionI’ll provide the final solution with a clear explanation\.Approach:The problem requires finding the minimum number of eversions until the array stops changing\. The key insight is that after each eversion, the new last element is the last element of the right partition \(elements\>x\>x\)\. This means we need to find how many steps it takes for the last element to become the maximum element in the array\.Algorithm:1\.Find the maximum elementMMin the array\.2\.If the last element is alreadyMM, return0\.3\.Otherwise, find the number of steps where each step moves to the last element in the array that is greater than the current element\.4\.This can be done efficiently by precomputing the maximum elements in decreasing order from the end of the array\.Solution:[⬇](data:text/plain;base64,ZGVmIHNvbHZlKCk6CiAgICBpbXBvcnQgc3lzCgogICAgaW5wdXQgPSBzeXMuc3RkaW4ucmVhZAogICAgZGF0YSA9IGlucHV0KCkuc3BsaXQoKQoKICAgIHQgPSBpbnQoZGF0YVswXSkKICAgIGlkeCA9IDEKICAgIHJlc3VsdHMgPSBbXQoKICAgIGZvciBfIGluIHJhbmdlKHQpOgogICAgICAgIG4gPSBpbnQoZGF0YVtpZHhdKTsgaWR4ICs9IDEKICAgICAgICBhID0gbGlzdChtYXAoaW50LCBkYXRhW2lkeDppZHggKyBuXSkpOyBpZHggKz0gbgoKICAgICAgICAjIEZpbmQgdGhlIG1heGltdW0gZWxlbWVudAogICAgICAgIE0gPSBtYXgoYSkKCiAgICAgICAgIyBJZiB0aGUgbGFzdCBlbGVtZW50IGlzIGFscmVhZHkgdGhlIG1heGltdW0sIG5vIGV2ZXJzaW9ucyBuZWVkZWQKICAgICAgICBpZiBhWy0xXSA9PSBNOgogICAgICAgICAgICByZXN1bHRzLmFwcGVuZCgwKQogICAgICAgICAgICBjb250aW51ZQoKICAgICAgICAjIFByZWNvbXB1dGUgdGhlIHN1ZmZpeC1tYXhpbWEgaW4gZGVjcmVhc2luZyBvcmRlcgogICAgICAgIG1heF9lbGVtZW50cyA9IFtdCiAgICAgICAgZm9yIGkgaW4gcmFuZ2UobiAtIDEsIC0xLCAtMSk6CiAgICAgICAgICAgIGlmIG5vdCBtYXhfZWxlbWVudHMgb3IgYVtpXSA+IG1heF9lbGVtZW50c1stMV06CiAgICAgICAgICAgICAgICBtYXhfZWxlbWVudHMuYXBwZW5kKGFbaV0pCgogICAgICAgICMgV2FsayB0aGUgY2hhaW4gdW50aWwgY3VycmVudCA9PSBNCiAgICAgICAgY3VycmVudCA9IGFbLTFdCiAgICAgICAgayA9IDAKICAgICAgICB3aGlsZSBjdXJyZW50ICE9IE06CiAgICAgICAgICAgIG5leHRfdmFsID0gLTEKICAgICAgICAgICAgZm9yIHZhbCBpbiBtYXhfZWxlbWVudHM6CiAgICAgICAgICAgICAgICBpZiB2YWwgPiBjdXJyZW50OgogICAgICAgICAgICAgICAgICAgIG5leHRfdmFsID0gdmFsCiAgICAgICAgICAgICAgICAgICAgYnJlYWsKICAgICAgICAgICAgaWYgbmV4dF92YWwgPT0gLTE6CiAgICAgICAgICAgICAgICBicmVhawogICAgICAgICAgICBjdXJyZW50ID0gbmV4dF92YWwKICAgICAgICAgICAgayArPSAxCgogICAgICAgIHJlc3VsdHMuYXBwZW5kKGspCgogICAgc3lzLnN0ZG91dC53cml0ZSgiXG4iLmpvaW4obWFwKHN0ciwgcmVzdWx0cykpKQoKaWYgX19uYW1lX18gPT0gIl9fbWFpbl9fIjoKICAgIHNvbHZlKCk=)defsolve\(\):importsysinput=sys\.stdin\.readdata=input\(\)\.split\(\)t=int\(data\[0\]\)idx=1results=\[\]for\_inrange\(t\):n=int\(data\[idx\]\);idx\+=1a=list\(map\(int,data\[idx:idx\+n\]\)\);idx\+=nM=max\(a\)ifa\[\-1\]==M:results\.append\(0\)continuemax\_elements=\[\]foriinrange\(n\-1,\-1,\-1\):ifnotmax\_elementsora\[i\]\>max\_elements\[\-1\]:max\_elements\.append\(a\[i\]\)current=a\[\-1\]k=0whilecurrent\!=M:next\_val=\-1forvalinmax\_elements:ifval\>current:next\_val=valbreakifnext\_val==\-1:breakcurrent=next\_valk\+=1results\.append\(k\)sys\.stdout\.write\("\\n"\.join\(map\(str,results\)\)\)if\_\_name\_\_=="\_\_main\_\_":solve\(\)Confidence:High\.

Final AnswerTests passed: 1/1\.Submitted solution accepted \(✓\)\. The critic caught the missing implementation in iteration 1 and the solver returned a clean final program in iteration 2\.

### E\.2A Case Study Where Reflection Repairs the Solver

We compare the same held\-out AIME 2024 example under the SFT cold\-start model and the RL checkpoint\. This case is more diagnostic than a simple final\-answer comparison: the RL trajectory first reaches an impossible geometric constraint, the critic identifies the faulty distance model, and the refined solver replaces it with a valid similar\-triangles equation\. In contrast, the SFT workflow uses anensemble\_judgestructure, but the judge repeats the same invalid angle\-packing assumption and no valid final answer is extracted\.

QuestionEight circles of radius3434are sequentially tangent, and two of the circles are tangent toABABandBCBCof triangleABCABC, respectively\.20242024circles of radius11can be arranged in the same manner\. The inradius of triangleABCABCcan be expressed asmn\\frac\{m\}\{n\}, wheremmandnnare relatively prime positive integers\. Findm\+nm\+n\.

Final AnswerGround truth:197197\.

Case SFT: Ensemble \+ judge adds breadth but does not repair the model

Domain:Math / circle packing

Designer \(raw output\)Model:SFT model Generated workflow pattern:ensemble\_judge Executor agents:AlgebraicSolver,GeometricSolver,CombinatorialSolver,MathJudge Evaluation result:✗no valid final answer extracted\.Generated workflow excerpt \(verbatim\):[⬇](data:text/plain;base64,ZnJvbSB3b3JrZmxvdyBpbXBvcnQgQWdlbnROb2RlLCBXb3JrZmxvdywgVG9vbFJlZ2lzdHJ5CmZyb20gd29ya2Zsb3cubm9kZXMgaW1wb3J0IEVuc2VtYmxlTm9kZQoKdG9vbF9yZWdpc3RyeSA9IFRvb2xSZWdpc3RyeSgpCgpzb2x2ZXIxID0gQWdlbnROb2RlKAogICAgbmFtZT0iQWxnZWJyYWljU29sdmVyIiwKICAgIHN5c3RlbV9wcm9tcHQ9KAogICAgICAgICJZb3UgYXJlIGEgbWF0aGVtYXRpY2lhbiB3aG8gcHJlZmVycyBhbGdlYnJhaWMgYXBwcm9hY2hlcy5cbiIKICAgICAgICAiU29sdmUgdGhlIHByb2JsZW0gdXNpbmcgYWxnZWJyYWljIG1hbmlwdWxhdGlvbiBhbmQgZXF1YXRpb25zLlxuIgogICAgICAgICJDb25zaWRlciB0aGUgZ2VvbWV0cnk6IGNpcmNsZXMgdGFuZ2VudCB0byB0d28gc2lkZXMgb2YgYW4gYW5nbGUgYW5kIHRvIGVhY2ggb3RoZXIuXG4iCiAgICAgICAgIkRlcml2ZSByZWxhdGlvbnNoaXBzIGJldHdlZW4gdGhlIHRyaWFuZ2xlJ3MgaW5yYWRpdXMsIHRoZSBjaXJjbGUgcmFkaWksIGFuZCB0aGUgbnVtYmVyIG9mIGNpcmNsZXMuXG4iCiAgICAgICAgIlNob3cgeW91ciByZWFzb25pbmcgc3RlcCBieSBzdGVwLlxuIgogICAgICAgICJQdXQgeW91ciBmaW5hbCBhbnN3ZXIgaW4gXFxib3hlZHt9IGZvcm1hdC4iCiAgICApLAogICAgdG9vbF9yZWdpc3RyeT10b29sX3JlZ2lzdHJ5LAogICAgbWF4X3R1cm5zPTEKKQoKc29sdmVyMiA9IEFnZW50Tm9kZSgKICAgIG5hbWU9Ikdlb21ldHJpY1NvbHZlciIsCiAgICBzeXN0ZW1fcHJvbXB0PSgKICAgICAgICAiWW91IGFyZSBhIG1hdGhlbWF0aWNpYW4gd2hvIHRoaW5rcyBnZW9tZXRyaWNhbGx5IGFuZCB2aXN1YWxseS5cbiIKICAgICAgICAiU29sdmUgdGhlIHByb2JsZW0gdXNpbmcgZ2VvbWV0cmljIGludHVpdGlvbi5cbiIKICAgICAgICAiSW1hZ2luZSB0aGUgY2lyY2xlcyBwYWNrZWQgaW4gdGhlIGFuZ2xlIGF0IHZlcnRleCBCIG9mIHRyaWFuZ2xlIEFCQy5cbiIKICAgICAgICAiVXNlIHByb3BlcnRpZXMgb2YgdGFuZ2VudHMsIHNpbWlsYXIgdHJpYW5nbGVzLCBhbmQgY2lyY2xlIGFycmFuZ2VtZW50cy5cbiIKICAgICAgICAiU2hvdyB5b3VyIHJlYXNvbmluZyBzdGVwIGJ5IHN0ZXAuXG4iCiAgICAgICAgIlB1dCB5b3VyIGZpbmFsIGFuc3dlciBpbiBcXGJveGVke30gZm9ybWF0LiIKICAgICksCiAgICB0b29sX3JlZ2lzdHJ5PXRvb2xfcmVnaXN0cnksCiAgICBtYXhfdHVybnM9MQopCgpzb2x2ZXIzID0gQWdlbnROb2RlKAogICAgbmFtZT0iQ29tYmluYXRvcmlhbFNvbHZlciIsCiAgICBzeXN0ZW1fcHJvbXB0PSgKICAgICAgICAiWW91IGFyZSBhIG1hdGhlbWF0aWNpYW4gc2tpbGxlZCBpbiBjb3VudGluZyBhbmQgcGF0dGVybnMuXG4iCiAgICAgICAgIlNvbHZlIHRoZSBwcm9ibGVtIGJ5IGFuYWx5emluZyB0aGUgcGF0dGVybiBvZiBjaXJjbGUgYXJyYW5nZW1lbnQuXG4iCiAgICAgICAgIk5vdGUgdGhhdCB0aGUgc2FtZSBwYXR0ZXJuIHdvcmtzIGZvciByYWRpaSAzNCBhbmQgMSwgc3VnZ2VzdGluZyBhIHNjYWxpbmcgZmFjdG9yLlxuIgogICAgICAgICJSZWxhdGUgdGhlIG51bWJlciBvZiBjaXJjbGVzIHRvIHRoZSBpbnJhZGl1cyB0aHJvdWdoIHRoZSBnZW9tZXRyeSBvZiB0aGUgcGFja2luZy5cbiIKICAgICAgICAiU2hvdyB5b3VyIHJlYXNvbmluZyBzdGVwIGJ5IHN0ZXAuXG4iCiAgICAgICAgIlB1dCB5b3VyIGZpbmFsIGFuc3dlciBpbiBcXGJveGVke30gZm9ybWF0LiIKICAgICksCiAgICB0b29sX3JlZ2lzdHJ5PXRvb2xfcmVnaXN0cnksCiAgICBtYXhfdHVybnM9MQopCgpqdWRnZV9hZ2VudCA9IEFnZW50Tm9kZSgKICAgIG5hbWU9Ik1hdGhKdWRnZSIsCiAgICBzeXN0ZW1fcHJvbXB0PSgKICAgICAgICAiWW91IGFyZSBhIHNlbmlvciBtYXRoZW1hdGljaWFuIGFjdGluZyBhcyBhIGp1ZGdlLlxuXG4iCiAgICAgICAgIllvdSB3aWxsIHJlY2VpdmUgc29sdXRpb25zIGZyb20gdGhyZWUgZGlmZmVyZW50IHNvbHZlcnMuXG4iCiAgICAgICAgIllvdXIgdGFzayBpcyB0bzpcbiIKICAgICAgICAiMS4gUmV2aWV3IGVhY2ggc29sdXRpb24gY2FyZWZ1bGx5XG4iCiAgICAgICAgIjIuIENoZWNrIHRoZSByZWFzb25pbmcgYW5kIGNhbGN1bGF0aW9uc1xuIgogICAgICAgICIzLiBJZGVudGlmeSB3aGljaCBzb2x1dGlvbiBpcyBjb3JyZWN0IChvciBtb3N0IGxpa2VseSBjb3JyZWN0KVxuIgogICAgICAgICI0LiBTZWxlY3QgT05FIGFuc3dlciBhcyB0aGUgZmluYWwgYW5zd2VyXG5cbiIKICAgICAgICAiSU1QT1JUQU5UOiBZb3UgbXVzdCBzZWxlY3QgT05FIGFuc3dlciBhbmQgb3V0cHV0IGl0IGluIFxcYm94ZWR7YW5zd2VyfSBmb3JtYXQuIgogICAgKSwKICAgIHRvb2xfcmVnaXN0cnk9dG9vbF9yZWdpc3RyeSwKICAgIG1heF90dXJucz0xCikKCmVuc2VtYmxlID0gRW5zZW1ibGVOb2RlKAogICAgbmFtZT0iTWF0aEVuc2VtYmxlIiwKICAgIGFnZW50cz1bc29sdmVyMSwgc29sdmVyMiwgc29sdmVyM10sCiAgICBzdHJhdGVneT0iY29uc2Vuc3VzIiwKICAgIGNvbnNlbnN1c19hZ2VudD1qdWRnZV9hZ2VudAopCgp3b3JrZmxvdyA9IFdvcmtmbG93KG5hbWU9Im1hdGhfZW5zZW1ibGVfanVkZ2UiKQp3b3JrZmxvdy5hZGRfbm9kZShlbnNlbWJsZSk=)fromworkflowimportAgentNode,Workflow,ToolRegistryfromworkflow\.nodesimportEnsembleNodetool\_registry=ToolRegistry\(\)solver1=AgentNode\(name="AlgebraicSolver",system\_prompt=\("Youareamathematicianwhoprefersalgebraicapproaches\.\\n""Solvetheproblemusingalgebraicmanipulationandequations\.\\n""Considerthegeometry:circlestangenttotwosidesofanangleandtoeachother\.\\n""Deriverelationshipsbetweenthetriangle’sinradius,thecircleradii,andthenumberofcircles\.\\n""Showyourreasoningstepbystep\.\\n""Putyourfinalanswerin\\\\boxed\{\}format\."\),tool\_registry=tool\_registry,max\_turns=1\)solver2=AgentNode\(name="GeometricSolver",system\_prompt=\("Youareamathematicianwhothinksgeometricallyandvisually\.\\n""Solvetheproblemusinggeometricintuition\.\\n""ImaginethecirclespackedintheangleatvertexBoftriangleABC\.\\n""Usepropertiesoftangents,similartriangles,andcirclearrangements\.\\n""Showyourreasoningstepbystep\.\\n""Putyourfinalanswerin\\\\boxed\{\}format\."\),tool\_registry=tool\_registry,max\_turns=1\)solver3=AgentNode\(name="CombinatorialSolver",system\_prompt=\("Youareamathematicianskilledincountingandpatterns\.\\n""Solvetheproblembyanalyzingthepatternofcirclearrangement\.\\n""Notethatthesamepatternworksforradii34and1,suggestingascalingfactor\.\\n""Relatethenumberofcirclestotheinradiusthroughthegeometryofthepacking\.\\n""Showyourreasoningstepbystep\.\\n""Putyourfinalanswerin\\\\boxed\{\}format\."\),tool\_registry=tool\_registry,max\_turns=1\)judge\_agent=AgentNode\(name="MathJudge",system\_prompt=\("Youareaseniormathematicianactingasajudge\.\\n\\n""Youwillreceivesolutionsfromthreedifferentsolvers\.\\n""Yourtaskisto:\\n""1\.Revieweachsolutioncarefully\\n""2\.Checkthereasoningandcalculations\\n""3\.Identifywhichsolutioniscorrect\(ormostlikelycorrect\)\\n""4\.SelectONEanswerasthefinalanswer\\n\\n""IMPORTANT:YoumustselectONEanswerandoutputitin\\\\boxed\{answer\}format\."\),tool\_registry=tool\_registry,max\_turns=1\)ensemble=EnsembleNode\(name="MathEnsemble",agents=\[solver1,solver2,solver3\],strategy="consensus",consensus\_agent=judge\_agent\)workflow=Workflow\(name="math\_ensemble\_judge"\)workflow\.add\_node\(ensemble\)

Executor: SFT GeometricSolver output[⬇](data:text/plain;base64,KipQcm9ibGVtIFR5cGU6KiogTUFUSAoKKipQcm9ibGVtIEFuYWx5c2lzOioqIFRoaXMgaXMgYSBnZW9tZXRyeSBwcm9ibGVtIGludm9sdmluZyBjaXJjbGUgcGFja2luZyBpbiBhbiBhbmdsZS4gV2UgaGF2ZSB0cmlhbmdsZSBBQkMgd2l0aCB2ZXJ0ZXggQi4gQXQgdmVydGV4IEIsIHdlIHBhY2sgY2lyY2xlcyBpbiB0aGUgYW5nbGUgZm9ybWVkIGJ5IHNpZGVzIEFCIGFuZCBCQy4KClRoZSBrZXkgaW5zaWdodCBpcyB0aGF0IHRoZSBwYWNraW5nIHBhdHRlcm4gaW4gYW4gYW5nbGUgaXMgZGV0ZXJtaW5lZCBieSB0aGUgYW5nbGUgaXRzZWxmLiBUaGUgbnVtYmVyIG9mIGNpcmNsZXMgdGhhdCBjYW4gZml0IGluIGFuIGFuZ2xlIGRlcGVuZHMgb24gdGhlIGFuZ2xlJ3MgbWVhc3VyZS4gU3BlY2lmaWNhbGx5LCBmb3IgY2lyY2xlcyBvZiByYWRpdXMgciBwYWNrZWQgaW4gYW4gYW5nbGUgdGhldGEsIHRoZSBudW1iZXIgTiBvZiBjaXJjbGVzIHRoYXQgY2FuIGZpdCBpcyBhcHByb3hpbWF0ZWx5Ck4gPSBmbG9vcihwaS8oMnIpICogdGFuKHRoZXRhLzIpKQpvciBzb21ldGhpbmcgc2ltaWxhciwgYnV0IGhlcmUgd2UgaGF2ZSBleGFjdCBjb3VudHMgZ2l2ZW4sIHNvIHdlIGNhbiBzZXQgdXAgYW4gZXF1YXRpb24uCgpGb3IgUiA9IDM0LCBOID0gOApGb3IgUiA9IDEsIE4gPSAyMDI0CgpUaHVzOgpwaS8oMiozNCkgKiB0YW4odGhldGEvMikgPSA4CnBpLygyKjEpICogdGFuKHRoZXRhLzIpID0gMjAyNAoKRGl2aWRlIHRoZSBzZWNvbmQgZXF1YXRpb24gYnkgdGhlIGZpcnN0OgoocGkvKDIqMSkgKiB0YW4odGhldGEvMikpIC8gKHBpLygyKjM0KSAqIHRhbih0aGV0YS8yKSkgPSAyMDI0LzgKMzQgPSAyNTMKClNvIDM0ID0gMjUzPyBUaGF0J3Mgbm90IHRydWUuIFdhaXQsIHRoYXQgc3VnZ2VzdHMgbXkgZm9ybXVsYSBpcyB3cm9uZy4KClxlbGlkZVt0aGUgc29sdmVyIGNvbnRpbnVlcyB0cnlpbmcgdmFyaWFudHMgb2YgdGhlIHNhbWUgYW5nbGUtcGFja2luZyBtb2RlbF0KClRoaXMgaXMgdG9vIGNvbXBsZXguIExldCBtZSB0cnkgYSBkaWZmZXJlbnQgYXBwcm9hY2g6IHRoZSBwYWNraW5nIGluIGFuIGFuZ2xlIGlzIHNpbWlsYXIgdG8gY2lyY2xlcyBpbiBhIHdlZGdlLiBUaGUgbnVtYmVyIG9mIGNpcmNsZXMgdGhhdCBjYW4gZml0IGluIGFuIGFuZ2xlIHRoZXRhIHdpdGggcmFkaXVzIHIgaXMgZ2l2ZW4gYnkKTiA9IGZsb29yKHBpLygycikgKiB0YW4odGhldGEvMikpLgpCdXQgd2UgaGF2ZSBleGFjdCBjb3VudHMsIHNvIHdlIGNhbiB3cml0ZToKRm9yIHIgPSAzNCwgTiA9IDgKRm9yIHIgPSAxLCBOID0gMjAyNAoKVGh1czoKcGkvKDIqMzQpICogdGFuKHRoZXRhLzIpID0gOApwaS8oMioxKSAqIHRhbih0aGV0YS8yKSA9IDIwMjQKCkRpdmlkZToKMzQgPSAyNTMsIGltcG9zc2libGUuIFNvIG15IGZvcm11bGEgaXMgd3Jvbmcu)\*\*ProblemType:\*\*MATH\*\*ProblemAnalysis:\*\*Thisisageometryprobleminvolvingcirclepackinginanangle\.WehavetriangleABCwithvertexB\.AtvertexB,wepackcirclesintheangleformedbysidesABandBC\.Thekeyinsightisthatthepackingpatterninanangleisdeterminedbytheangleitself\.Thenumberofcirclesthatcanfitinanangledependsontheangle’smeasure\.Specifically,forcirclesofradiusrpackedinanangletheta,thenumberNofcirclesthatcanfitisapproximatelyN=floor\(pi/\(2r\)\*tan\(theta/2\)\)orsomethingsimilar,butherewehaveexactcountsgiven,sowecansetupanequation\.ForR=34,N=8ForR=1,N=2024Thus:pi/\(2\*34\)\*tan\(theta/2\)=8pi/\(2\*1\)\*tan\(theta/2\)=2024Dividethesecondequationbythefirst:\(pi/\(2\*1\)\*tan\(theta/2\)\)/\(pi/\(2\*34\)\*tan\(theta/2\)\)=2024/834=253So34=253?That’snottrue\.Wait,thatsuggestsmyformulaiswrong\.\\elide\[thesolvercontinuestryingvariantsofthesameangle\-packingmodel\]Thisistoocomplex\.Letmetryadifferentapproach:thepackinginanangleissimilartocirclesinawedge\.ThenumberofcirclesthatcanfitinananglethetawithradiusrisgivenbyN=floor\(pi/\(2r\)\*tan\(theta/2\)\)\.Butwehaveexactcounts,sowecanwrite:Forr=34,N=8Forr=1,N=2024Thus:pi/\(2\*34\)\*tan\(theta/2\)=8pi/\(2\*1\)\*tan\(theta/2\)=2024Divide:34=253,impossible\.Somyformulaiswrong\.

JudgeSFT judge output \(verbatim excerpt\)\.[⬇](data:text/plain;base64,KipQcm9ibGVtIFR5cGU6KiogTUFUSAoKKipQcm9ibGVtIEFuYWx5c2lzOioqIFRoaXMgaXMgYSBnZW9tZXRyeSBwcm9ibGVtIGludm9sdmluZyBjaXJjbGUgcGFja2luZyBpbiBhbiBhbmdsZS4gV2UgaGF2ZToKLSBUcmlhbmdsZSBBQkMgd2l0aCBpbnJhZGl1cyByICh0byBiZSBmb3VuZCkKLSBUd28gY29uZmlndXJhdGlvbnM6CiAgLSA4IGNpcmNsZXMgb2YgcmFkaXVzIDM0IGFycmFuZ2VkIHNlcXVlbnRpYWxseSB0YW5nZW50IHRvIGVhY2ggb3RoZXIgYW5kIHRhbmdlbnQgdG8gc2lkZXMgQUIgYW5kIEJDIHJlc3BlY3RpdmVseQogIC0gMjAyNCBjaXJjbGVzIG9mIHJhZGl1cyAxIGFycmFuZ2VkIGluIHRoZSBzYW1lIHBhdHRlcm4KCktleSBpbnNpZ2h0OiBUaGUgYXJyYW5nZW1lbnQgbGlrZWx5IGNvbnNpc3RzIG9mIGNpcmNsZXMgcGFja2VkIGFsb25nIHRoZSBhbmdsZSBiaXNlY3RvciBvZiBhbmdsZSBBQkMsIGVhY2ggdGFuZ2VudCB0byBib3RoIHNpZGVzIEFCIGFuZCBCQywgd2l0aCBhZGphY2VudCBjaXJjbGVzIHRhbmdlbnQgdG8gZWFjaCBvdGhlci4KClxlbGlkZVt0aGUganVkZ2UgY29waWVzIHRoZSBpbmNvcnJlY3QgYnJhbmNoIHJhdGhlciB0aGFuIHJlamVjdGluZyBpdF0KCk5vdyBhcHBseSB0byBib3RoIGNhc2VzOgoxLiBSID0gMzQsIE4gPSA4OgogICByID0gMzQgKiAoOCAtIHNpbih0aGV0YS8yKSkgKiBjb3ModGhldGEvMikKMi4gUiA9IDEsIE4gPSAyMDI0OgogICByID0gMSAqICgyMDI0IC0gc2luKHRoZXRhLzIpKSAqIGNvcyh0aGV0YS8yKQoKU2luY2UgciBpcyB0aGUgc2FtZSwgd2UgY2FuIHNldCB0aGVtIGVxdWFsOgozNCAqICg4IC0gc2luKHRoZXRhLzIpKSAqIGNvcyh0aGV0YS8yKSA9CigyMDI0IC0gc2luKHRoZXRhLzIpKSAqIGNvcyh0aGV0YS8yKQoKQXNzdW1pbmcgY29zKHRoZXRhLzIpICE9IDAsIHdlIGNhbiBkaXZpZGU6CjM0ICogKDggLSBzaW4odGhldGEvMikpID0gMjAyNCAtIHNpbih0aGV0YS8yKQoyNzIgLSAzNCBzaW4odGhldGEvMikgPSAyMDI0IC0gc2luKHRoZXRhLzIpCjI3MiAtIDIwMjQgPSAzNCBzaW4odGhldGEvMikgLSBzaW4odGhldGEvMikKLTE3NTIgPSAzMyBzaW4odGhldGEvMikKc2luKHRoZXRhLzIpID0gLTE3NTIgLyAzMyA9IC01My4wOTA5Li4uCgpUaGlzIGlzIGltcG9zc2libGUgYmVjYXVzZSBzaW4odGhldGEvMikgbXVzdCBiZSBwb3NpdGl2ZSAoMCA8IHRoZXRhLzIgPCA5MCBkZWdyZWVzKS4gU28gbXkgbW9kZWwgaXMgc3RpbGwgd3Jvbmcu)\*\*ProblemType:\*\*MATH\*\*ProblemAnalysis:\*\*Thisisageometryprobleminvolvingcirclepackinginanangle\.Wehave:\-TriangleABCwithinradiusr\(tobefound\)\-Twoconfigurations:\-8circlesofradius34arrangedsequentiallytangenttoeachotherandtangenttosidesABandBCrespectively\-2024circlesofradius1arrangedinthesamepatternKeyinsight:ThearrangementlikelyconsistsofcirclespackedalongtheanglebisectorofangleABC,eachtangenttobothsidesABandBC,withadjacentcirclestangenttoeachother\.\\elide\[thejudgecopiestheincorrectbranchratherthanrejectingit\]Nowapplytobothcases:1\.R=34,N=8:r=34\*\(8\-sin\(theta/2\)\)\*cos\(theta/2\)2\.R=1,N=2024:r=1\*\(2024\-sin\(theta/2\)\)\*cos\(theta/2\)Sinceristhesame,wecansetthemequal:34\*\(8\-sin\(theta/2\)\)\*cos\(theta/2\)=\(2024\-sin\(theta/2\)\)\*cos\(theta/2\)Assumingcos\(theta/2\)\!=0,wecandivide:34\*\(8\-sin\(theta/2\)\)=2024\-sin\(theta/2\)272\-34sin\(theta/2\)=2024\-sin\(theta/2\)272\-2024=34sin\(theta/2\)\-sin\(theta/2\)\-1752=33sin\(theta/2\)sin\(theta/2\)=\-1752/33=\-53\.0909\.\.\.Thisisimpossiblebecausesin\(theta/2\)mustbepositive\(0<theta/2<90degrees\)\.Somymodelisstillwrong\.Failure mode\.The SFT workflow samples multiple solver roles, but none of them provides a mechanism for forcing revision after the contradiction is found\. The judge repeats the invalid model instead of rejecting it, so the transcript ends without a boxed valid answer\.

Case RL: Reflection turns the contradiction into a corrected derivation

Domain:Math / circle packing

Designer \(raw output\)Model:RL checkpoint Generated workflow pattern:reflection Executor agents:MathSolver\+MathCritic Evaluation result:✓extracted answer197197\.Generated workflow excerpt \(verbatim\):[⬇](data:text/plain;base64,ZnJvbSB3b3JrZmxvdyBpbXBvcnQgQWdlbnROb2RlLCBXb3JrZmxvdywgVG9vbFJlZ2lzdHJ5CmZyb20gd29ya2Zsb3cubm9kZXMgaW1wb3J0IFJlZmxlY3Rpb25Ob2RlCgp0b29sX3JlZ2lzdHJ5ID0gVG9vbFJlZ2lzdHJ5KCkKCnNvbHZlcl9hZ2VudCA9IEFnZW50Tm9kZSgKICAgIG5hbWU9Ik1hdGhTb2x2ZXIiLAogICAgc3lzdGVtX3Byb21wdD0oCiAgICAgICAgIllvdSBhcmUgYW4gZXhwZXJ0IG1hdGhlbWF0aWNpYW4gc3BlY2lhbGl6aW5nIGluIGdlb21ldHJ5IGFuZCBjaXJjbGUgcGFja2luZy5cblxuIgogICAgICAgICJBUFBST0FDSDpcbiIKICAgICAgICAiMS4gVW5kZXJzdGFuZCB0aGUgY29uZmlndXJhdGlvbjogdHJpYW5nbGUgQUJDIHdpdGggaW5yYWRpdXMgci5cbiIKICAgICAgICAiMi4gVGhlcmUgYXJlIHR3byBzZXF1ZW5jZXMgb2YgY2lyY2xlczogOCBjaXJjbGVzIG9mIHJhZGl1cyAzNCwgYW5kIDIwMjQgY2lyY2xlcyBvZiByYWRpdXMgMS5cbiIKICAgICAgICAiMy4gRWFjaCBzZXF1ZW5jZSBpcyB0YW5nZW50IHRvIHR3byBzaWRlcyBvZiB0aGUgdHJpYW5nbGUgYW5kIHRvIGVhY2ggb3RoZXIuXG4iCiAgICAgICAgIjQuIFRoZSBjZW50ZXJzIG9mIHRoZSBjaXJjbGVzIGluIGVhY2ggc2VxdWVuY2UgbGllIG9uIGEgbGluZSBwYXJhbGxlbCB0byB0aGUgdGhpcmQgc2lkZSBvZiB0aGUgdHJpYW5nbGUuXG4iCiAgICAgICAgIjUuIFRoZSBkaXN0YW5jZSBiZXR3ZWVuIGNlbnRlcnMgb2YgY29uc2VjdXRpdmUgY2lyY2xlcyBpcyAyKnJhZGl1cy5cbiIKICAgICAgICAiNi4gVGhlIGlucmFkaXVzIHIgaXMgcmVsYXRlZCB0byB0aGUgZGlzdGFuY2UgZnJvbSB0aGUgaW5jZW50ZXIgdG8gdGhlIHNpZGUsIGFuZCB0aGUgZGlzdGFuY2UgZnJvbSB0aGUgaW5jZW50ZXIgdG8gdGhlIGxpbmUgb2YgY2VudGVycy5cbiIKICAgICAgICAiNy4gU2V0IHVwIGVxdWF0aW9ucyB1c2luZyB0aGUgZ2l2ZW4gbnVtYmVycyA4LCAzNCwgMjAyNCwgMS5cbiIKICAgICAgICAiOC4gU29sdmUgZm9yIHIsIGV4cHJlc3MgYXMgbS9uLCB0aGVuIGNvbXB1dGUgbStuLlxuXG4iCiAgICAgICAgIklmIHlvdSByZWNlaXZlIGZlZWRiYWNrIGZyb20gYSBjcml0aWMsIGNhcmVmdWxseSBjb25zaWRlciB0aGVpciBwb2ludHMgYW5kOlxuIgogICAgICAgICItIEZpeCBhbnkgZXJyb3JzIGluIGdlb21ldHJ5IG9yIGFsZ2VicmEuXG4iCiAgICAgICAgIi0gQ2xhcmlmeSBhbnkgYW1iaWd1b3VzIHN0ZXBzLlxuIgogICAgICAgICItIFByb3ZpZGUgYSBjb3JyZWN0ZWQgc29sdXRpb24uXG5cbiIKICAgICAgICAiT1VUUFVUOiBTaG93IHlvdXIgY29tcGxldGUgcmVhc29uaW5nIHN0ZXAgYnkgc3RlcCwgdGhlbiBwdXQgeW91ciBmaW5hbCBhbnN3ZXIgaW4gXFxib3hlZHt9IGZvcm1hdC4iCiAgICApLAogICAgdG9vbF9yZWdpc3RyeT10b29sX3JlZ2lzdHJ5LAogICAgbWF4X3R1cm5zPTEKKQoKY3JpdGljX2FnZW50ID0gQWdlbnROb2RlKAogICAgbmFtZT0iTWF0aENyaXRpYyIsCiAgICBzeXN0ZW1fcHJvbXB0PSgKICAgICAgICAiWW91IGFyZSBhIG1ldGljdWxvdXMgbWF0aCByZXZpZXdlciB3aG8gY2hlY2tzIGdlb21ldHJ5IHNvbHV0aW9ucyBmb3IgZXJyb3JzLlxuXG4iCiAgICAgICAgIllPVVIgVEFTSzpcbiIKICAgICAgICAiMS4gVmVyaWZ5IHRoZSBnZW9tZXRyaWMgY29uZmlndXJhdGlvbjogY2lyY2xlcyB0YW5nZW50IHRvIHR3byBzaWRlcyBvZiBhIHRyaWFuZ2xlIGFuZCB0byBlYWNoIG90aGVyLlxuIgogICAgICAgICIyLiBDaGVjayB0aGUgYXNzdW1wdGlvbiB0aGF0IGNlbnRlcnMgYXJlIGNvbGxpbmVhciBhbmQgZXF1YWxseSBzcGFjZWQuXG4iCiAgICAgICAgIjMuIFZlcmlmeSB0aGUgcmVsYXRpb25zaGlwIGJldHdlZW4gdGhlIGlucmFkaXVzLCB0aGUgZGlzdGFuY2UgZnJvbSB0aGUgaW5jZW50ZXIgdG8gdGhlIGxpbmUgb2YgY2VudGVycywgYW5kIHRoZSBudW1iZXIgb2YgY2lyY2xlcy5cbiIKICAgICAgICAiNC4gQ2hlY2sgdGhlIGFsZ2VicmFpYyBtYW5pcHVsYXRpb24gd2hlbiBzb2x2aW5nIGZvciByLlxuIgogICAgICAgICI1LiBFbnN1cmUgdGhlIGZpbmFsIGV4cHJlc3Npb24gZm9yIHIgaXMgaW4gbG93ZXN0IHRlcm1zIG0vbiwgYW5kIG0rbiBpcyBjb21wdXRlZCBjb3JyZWN0bHkuXG5cbiIKICAgICAgICAiQmUgdGhvcm91Z2ggYnV0IGNvbnN0cnVjdGl2ZSBpbiB5b3VyIGNyaXRpY2lzbS4iCiAgICApLAogICAgdG9vbF9yZWdpc3RyeT10b29sX3JlZ2lzdHJ5LAogICAgbWF4X3R1cm5zPTEKKQoKcmVmbGVjdGlvbiA9IFJlZmxlY3Rpb25Ob2RlKAogICAgbmFtZT0iU29sdmVyQ3JpdGljUmVmbGVjdGlvbiIsCiAgICBhZ2VudD1zb2x2ZXJfYWdlbnQsCiAgICBjcml0aWNfYWdlbnQ9Y3JpdGljX2FnZW50LAogICAgbnVtX2l0ZXJhdGlvbnM9MgopCgp3b3JrZmxvdyA9IFdvcmtmbG93KG5hbWU9Im1hdGhfc29sdmVyX2NyaXRpYyIpCndvcmtmbG93LmFkZF9ub2RlKHJlZmxlY3Rpb24p)fromworkflowimportAgentNode,Workflow,ToolRegistryfromworkflow\.nodesimportReflectionNodetool\_registry=ToolRegistry\(\)solver\_agent=AgentNode\(name="MathSolver",system\_prompt=\("Youareanexpertmathematicianspecializingingeometryandcirclepacking\.\\n\\n""APPROACH:\\n""1\.Understandtheconfiguration:triangleABCwithinradiusr\.\\n""2\.Therearetwosequencesofcircles:8circlesofradius34,and2024circlesofradius1\.\\n""3\.Eachsequenceistangenttotwosidesofthetriangleandtoeachother\.\\n""4\.Thecentersofthecirclesineachsequencelieonalineparalleltothethirdsideofthetriangle\.\\n""5\.Thedistancebetweencentersofconsecutivecirclesis2\*radius\.\\n""6\.Theinradiusrisrelatedtothedistancefromtheincentertotheside,andthedistancefromtheincentertothelineofcenters\.\\n""7\.Setupequationsusingthegivennumbers8,34,2024,1\.\\n""8\.Solveforr,expressasm/n,thencomputem\+n\.\\n\\n""Ifyoureceivefeedbackfromacritic,carefullyconsidertheirpointsand:\\n""\-Fixanyerrorsingeometryoralgebra\.\\n""\-Clarifyanyambiguoussteps\.\\n""\-Provideacorrectedsolution\.\\n\\n""OUTPUT:Showyourcompletereasoningstepbystep,thenputyourfinalanswerin\\\\boxed\{\}format\."\),tool\_registry=tool\_registry,max\_turns=1\)critic\_agent=AgentNode\(name="MathCritic",system\_prompt=\("Youareameticulousmathreviewerwhochecksgeometrysolutionsforerrors\.\\n\\n""YOURTASK:\\n""1\.Verifythegeometricconfiguration:circlestangenttotwosidesofatriangleandtoeachother\.\\n""2\.Checktheassumptionthatcentersarecollinearandequallyspaced\.\\n""3\.Verifytherelationshipbetweentheinradius,thedistancefromtheincentertothelineofcenters,andthenumberofcircles\.\\n""4\.Checkthealgebraicmanipulationwhensolvingforr\.\\n""5\.Ensurethefinalexpressionforrisinlowesttermsm/n,andm\+niscomputedcorrectly\.\\n\\n""Bethoroughbutconstructiveinyourcriticism\."\),tool\_registry=tool\_registry,max\_turns=1\)reflection=ReflectionNode\(name="SolverCriticReflection",agent=solver\_agent,critic\_agent=critic\_agent,num\_iterations=2\)workflow=Workflow\(name="math\_solver\_critic"\)workflow\.add\_node\(reflection\)

Executor: RL MathSolver first attempt[⬇](data:text/plain;base64,KipQcm9ibGVtIFR5cGU6KiogTUFUSAoKKipQcm9ibGVtIEFuYWx5c2lzOioqIFRoaXMgaXMgYSBnZW9tZXRyeSBwcm9ibGVtIGludm9sdmluZyBjaXJjbGUgcGFja2luZyBpbiBhIHRyaWFuZ2xlLiBXZSBoYXZlOgotIFRyaWFuZ2xlIEFCQyB3aXRoIGlucmFkaXVzIHIKLSBUd28gc2VxdWVuY2VzIG9mIGNpcmNsZXM6CiAgMS4gOCBjaXJjbGVzIG9mIHJhZGl1cyAzNCwgZWFjaCB0YW5nZW50IHRvIHR3byBzaWRlcyBvZiB0aGUgdHJpYW5nbGUgYW5kIHRvIGVhY2ggb3RoZXIKICAyLiAyMDI0IGNpcmNsZXMgb2YgcmFkaXVzIDEsIGFycmFuZ2VkIHNpbWlsYXJseQotIEVhY2ggc2VxdWVuY2UncyBjZW50ZXJzIGxpZSBvbiBhIGxpbmUgcGFyYWxsZWwgdG8gdGhlIHRoaXJkIHNpZGUKLSBUaGUgZGlzdGFuY2UgYmV0d2VlbiBjZW50ZXJzIG9mIGNvbnNlY3V0aXZlIGNpcmNsZXMgaW4gYSBzZXF1ZW5jZSBpcyAyKnJhZGl1cwoKKipBcHByb2FjaDoqKgoxLiBMZXQgdGhlIHRyaWFuZ2xlIGhhdmUgc2lkZXMgYSwgYiwgYyB3aXRoIGlucmFkaXVzIHIKMi4gRm9yIHRoZSBzZXF1ZW5jZSBvZiBjaXJjbGVzIHRhbmdlbnQgdG8gc2lkZXMgQUIgYW5kIEJDLCB0aGUgbGluZSBvZiBjZW50ZXJzIGlzIHBhcmFsbGVsIHRvIHNpZGUgQUMKMy4gVGhlIGRpc3RhbmNlIGZyb20gdGhlIGluY2VudGVyIHRvIHNpZGUgQUMgaXMgcgo0LiBUaGUgZGlzdGFuY2UgZnJvbSB0aGUgaW5jZW50ZXIgdG8gdGhlIGxpbmUgb2YgY2VudGVycyBvZiB0aGUgMzQtcmFkaXVzIGNpcmNsZXMgaXMgciAtIDM0IChvciByIC0gMSBmb3IgdGhlIDEtcmFkaXVzIGNpcmNsZXMpCjUuIFRoZSBkaXN0YW5jZSBiZXR3ZWVuIGNlbnRlcnMgb2YgY29uc2VjdXRpdmUgY2lyY2xlcyBpbiBhIHNlcXVlbmNlIGlzIDIqcmFkaXVzCgpcZWxpZGVbdGhlIGZpcnN0IGF0dGVtcHQgZGVyaXZlcyBlcXVhdGlvbnMgZnJvbSB0aGUgd3JvbmcgZGlzdGFuY2UgcmVsYXRpb25dCgpCdXQgd2UgYWxzbyBrbm93IHRoYXQgdGhpcyBsZW5ndGggaXMgKG4tMSkgKiAycjEuIFNvOgoKTCAqIChyIC0gZCkvciA9IChuLTEpICogMnIxCgpCdXQgZCA9IHIgLSByMSwgc286CgpMICogKHIgLSAociAtIHIxKSkvciA9IChuLTEpICogMnIxCkwgKiByMS9yID0gKG4tMSkgKiAycjEKTC9yID0gMihuLTEpCkwgPSAycihuLTEpCgpTbyBMID0gMnIobi0xKS4gQnV0IHRoaXMgbXVzdCBiZSB0aGUgc2FtZSBmb3IgYm90aCBzZXF1ZW5jZXMsIHNvOgoKMnIoOC0xKSA9IDJyKDIwMjQtMSkKMTRyID0gNDA0NnIKMTQgPSA0MDQ2CgpUaGlzIGlzIGltcG9zc2libGUsIHNvIG15IGFzc3VtcHRpb24gdGhhdCB0aGUgbGluZSBvZiBjZW50ZXJzIGlzIGF0IGRpc3RhbmNlIHIgLSByMSBmcm9tIEFDIGlzIHdyb25nLg==)\*\*ProblemType:\*\*MATH\*\*ProblemAnalysis:\*\*Thisisageometryprobleminvolvingcirclepackinginatriangle\.Wehave:\-TriangleABCwithinradiusr\-Twosequencesofcircles:1\.8circlesofradius34,eachtangenttotwosidesofthetriangleandtoeachother2\.2024circlesofradius1,arrangedsimilarly\-Eachsequence’scenterslieonalineparalleltothethirdside\-Thedistancebetweencentersofconsecutivecirclesinasequenceis2\*radius\*\*Approach:\*\*1\.Letthetrianglehavesidesa,b,cwithinradiusr2\.ForthesequenceofcirclestangenttosidesABandBC,thelineofcentersisparalleltosideAC3\.ThedistancefromtheincentertosideACisr4\.Thedistancefromtheincentertothelineofcentersofthe34\-radiuscirclesisr\-34\(orr\-1forthe1\-radiuscircles\)5\.Thedistancebetweencentersofconsecutivecirclesinasequenceis2\*radius\\elide\[thefirstattemptderivesequationsfromthewrongdistancerelation\]Butwealsoknowthatthislengthis\(n\-1\)\*2r1\.So:L\*\(r\-d\)/r=\(n\-1\)\*2r1Butd=r\-r1,so:L\*\(r\-\(r\-r1\)\)/r=\(n\-1\)\*2r1L\*r1/r=\(n\-1\)\*2r1L/r=2\(n\-1\)L=2r\(n\-1\)SoL=2r\(n\-1\)\.Butthismustbethesameforbothsequences,so:2r\(8\-1\)=2r\(2024\-1\)14r=4046r14=4046Thisisimpossible,somyassumptionthatthelineofcentersisatdistancer\-r1fromACiswrong\.

JudgeRL MathCritic feedback \(verbatim excerpt\)\.[⬇](data:text/plain;base64,KipDcml0aXF1ZSBvZiB0aGUgUmVzcG9uc2U6KioKCioqMS4gRXJyb3JzIGFuZCBJbmFjY3VyYWNpZXM6KioKLSAqKk1pc3VuZGVyc3RhbmRpbmcgb2YgR2VvbWV0cnkgQ29uZmlndXJhdGlvbjoqKiBUaGUgcHJvYmxlbSBzdGF0ZXMgImVpZ2h0IGNpcmNsZXMgb2YgcmFkaXVzIDM0IGFyZSBzZXF1ZW50aWFsbHkgdGFuZ2VudCwgYW5kIHR3byBvZiB0aGUgY2lyY2xlcyBhcmUgdGFuZ2VudCB0byBBQiBhbmQgQkMgb2YgdHJpYW5nbGUgQUJDLCByZXNwZWN0aXZlbHkuIiBUaGlzIHN1Z2dlc3RzIHRoZXJlIGFyZSB0d28gc2VxdWVuY2VzIG9mIGNpcmNsZXM6IG9uZSB0YW5nZW50IHRvIEFCIGFuZCBCQyAoYW5kIHRoZSBvdGhlciB0d28gc2lkZXM/KSwgYW5kIGFub3RoZXIgc2VxdWVuY2Ugb2YgMjAyNCBjaXJjbGVzIG9mIHJhZGl1cyAxIGFycmFuZ2VkIHNpbWlsYXJseS4KLSAqKkluY29ycmVjdCBEaXN0YW5jZSBBc3N1bXB0aW9uOioqIFRoZSByZXNwb25zZSBpbmNvcnJlY3RseSBhc3N1bWVzIHRoZSBsaW5lIG9mIGNlbnRlcnMgaXMgYXQgZGlzdGFuY2UgciAtIHIxIGZyb20gc2lkZSBBQywgYnV0IHRoaXMgbGVhZHMgdG8gYSBjb250cmFkaWN0aW9uLiBUaGUgY29ycmVjdCBkaXN0YW5jZSBmcm9tIHRoZSBsaW5lIG9mIGNlbnRlcnMgdG8gc2lkZSBBQyBzaG91bGQgYmUgZGVyaXZlZCBmcm9tIHRoZSBnZW9tZXRyeSBvZiB0aGUgdHJpYW5nbGUgYW5kIHRoZSBjaXJjbGUgdGFuZ2VuY3kgY29uZGl0aW9ucy4KLSAqKkFsZ2VicmFpYyBFcnJvcjoqKiBUaGUgZGVyaXZhdGlvbiBsZWFkcyB0byBhbiBpbXBvc3NpYmxlIGVxdWF0aW9uICgxNCA9IDQwNDYpLCBpbmRpY2F0aW5nIGEgZnVuZGFtZW50YWwgZmxhdyBpbiB0aGUgZ2VvbWV0cmljIG1vZGVsLgoKKio0LiBTcGVjaWZpYyBJbXByb3ZlbWVudHM6KioKLSAqKkNsYXJpZnkgdGhlIEdlb21ldHJ5OioqIFRoZXJlIGFyZSB0d28gc2VxdWVuY2VzIG9mIGNpcmNsZXM6IG9uZSB0YW5nZW50IHRvIEFCIGFuZCBCQywgYW5kIGFub3RoZXIgdGFuZ2VudCB0byBBQiBhbmQgQUMgKG9yIEJDIGFuZCBBQykuIEVhY2ggc2VxdWVuY2UgaGFzIGl0cyBvd24gbGluZSBvZiBjZW50ZXJzIHBhcmFsbGVsIHRvIHRoZSB0aGlyZCBzaWRlLgotICoqQ29ycmVjdCBEaXN0YW5jZSBDYWxjdWxhdGlvbjoqKiBUaGUgZGlzdGFuY2UgZnJvbSB0aGUgbGluZSBvZiBjZW50ZXJzIHRvIHRoZSB0aGlyZCBzaWRlIHNob3VsZCBiZSBkZXJpdmVkIHVzaW5nIHNpbWlsYXIgdHJpYW5nbGVzIG9yIGNvb3JkaW5hdGUgZ2VvbWV0cnkuCi0gKipTZXQgVXAgQ29ycmVjdCBFcXVhdGlvbnM6KiogRm9yIGVhY2ggc2VxdWVuY2UsIHJlbGF0ZSB0aGUgbnVtYmVyIG9mIGNpcmNsZXMsIHRoZWlyIHJhZGlpLCBhbmQgdGhlIGlucmFkaXVzIHVzaW5nIHRoZSBnZW9tZXRyeSBvZiB0aGUgdHJpYW5nbGUuCi0gKipTb2x2ZSBmb3IgSW5yYWRpdXM6KiogVXNlIHRoZSB0d28gc2VxdWVuY2VzIHRvIHNldCB1cCB0d28gZXF1YXRpb25zIGFuZCBzb2x2ZSBmb3IgdGhlIGlucmFkaXVzLg==)\*\*CritiqueoftheResponse:\*\*\*\*1\.ErrorsandInaccuracies:\*\*\-\*\*MisunderstandingofGeometryConfiguration:\*\*Theproblemstates"eightcirclesofradius34aresequentiallytangent,andtwoofthecirclesaretangenttoABandBCoftriangleABC,respectively\."Thissuggeststherearetwosequencesofcircles:onetangenttoABandBC\(andtheothertwosides?\),andanothersequenceof2024circlesofradius1arrangedsimilarly\.\-\*\*IncorrectDistanceAssumption:\*\*Theresponseincorrectlyassumesthelineofcentersisatdistancer\-r1fromsideAC,butthisleadstoacontradiction\.ThecorrectdistancefromthelineofcenterstosideACshouldbederivedfromthegeometryofthetriangleandthecircletangencyconditions\.\-\*\*AlgebraicError:\*\*Thederivationleadstoanimpossibleequation\(14=4046\),indicatingafundamentalflawinthegeometricmodel\.\*\*4\.SpecificImprovements:\*\*\-\*\*ClarifytheGeometry:\*\*Therearetwosequencesofcircles:onetangenttoABandBC,andanothertangenttoABandAC\(orBCandAC\)\.Eachsequencehasitsownlineofcentersparalleltothethirdside\.\-\*\*CorrectDistanceCalculation:\*\*Thedistancefromthelineofcenterstothethirdsideshouldbederivedusingsimilartrianglesorcoordinategeometry\.\-\*\*SetUpCorrectEquations:\*\*Foreachsequence,relatethenumberofcircles,theirradii,andtheinradiususingthegeometryofthetriangle\.\-\*\*SolveforInradius:\*\*Usethetwosequencestosetuptwoequationsandsolvefortheinradius\.

Executor: RL MathSolver refined output after reflection[⬇](data:text/plain;base64,KipJbXByb3ZlZCBTb2x1dGlvbjoqKgoKTGV0J3MgZGVub3RlOgotIEZvciB0aGUgMzQtcmFkaXVzIGNpcmNsZXM6IG4xID0gOCwgcjEgPSAzNAotIEZvciB0aGUgMS1yYWRpdXMgY2lyY2xlczogbjIgPSAyMDI0LCByMiA9IDEKCkNvbnNpZGVyIHRoZSBzZXF1ZW5jZSBvZiBjaXJjbGVzIHRhbmdlbnQgdG8gc2lkZXMgQUIgYW5kIEJDLiBUaGVpciBjZW50ZXJzIGxpZSBvbiBhIGxpbmUgcGFyYWxsZWwgdG8gc2lkZSBBQy4gTGV0IEwgYmUgdGhlIGxlbmd0aCBvZiBzaWRlIEFDLgoKVGhlIGxlbmd0aCBvZiB0aGUgc2VnbWVudCBwYXJhbGxlbCB0byBBQyBhdCBkaXN0YW5jZSBkIGZyb20gQUMgaXMKTCAqIChyIC0gZCkvci4KVGhpcyBsZW5ndGggc2hvdWxkIGJlIGVxdWFsIHRvIChuLTEpICogMnIxIGJlY2F1c2UgdGhlIGRpc3RhbmNlIGJldHdlZW4gY29uc2VjdXRpdmUgY2VudGVycyBpcyAycjEgYW5kIHRoZXJlIGFyZSAobi0xKSBnYXBzIGJldHdlZW4gbiBjaXJjbGVzLgoKU28gd2UgaGF2ZToKTCAqIChyIC0gcjEpL3IgPSAobi0xKSAqIDJyMQoKRm9yIHRoZSBmaXJzdCBzZXF1ZW5jZSAocjEgPSAzNCwgbiA9IDgpOgpMICogKHIgLSAzNCkvciA9IDcgKiAyICogMzQKTCAqIChyIC0gMzQpL3IgPSA0NzYKCkZvciB0aGUgc2Vjb25kIHNlcXVlbmNlIChyMSA9IDEsIG4gPSAyMDI0KToKTCAqIChyIC0gMSkvciA9IDIwMjMgKiAyICogMQpMICogKHIgLSAxKS9yID0gNDA0NgoKTm93IHdlIGhhdmUgdHdvIGVxdWF0aW9uczoKMSkgTCAqIChyIC0gMzQpL3IgPSA0NzYKMikgTCAqIChyIC0gMSkvciA9IDQwNDYKCkRpdmlkZSBlcXVhdGlvbiAyIGJ5IGVxdWF0aW9uIDE6CihMICogKHIgLSAxKS9yKSAvIChMICogKHIgLSAzNCkvcikgPSA0MDQ2LzQ3NgoociAtIDEpLyhyIC0gMzQpID0gNDA0Ni80NzYKClNpbXBsaWZ5IDQwNDYvNDc2OgpCb3RoIGRpdmlzaWJsZSBieSAyOiAyMDIzLzIzOC4KMjM4ID0gMiAqIDcgKiAxNzsgMjAyMyA9IDcgKiAxN14yLgpTbyAyMDIzLzIzOCA9IDE3LzIuCgpTbzoKKHIgLSAxKS8ociAtIDM0KSA9IDE3LzIKMihyIC0gMSkgPSAxNyhyIC0gMzQpCjJyIC0gMiA9IDE3ciAtIDU3OAo1NzYgPSAxNXIKciA9IDU3Ni8xNSA9IDE5Mi81CgpTbyB0aGUgaW5yYWRpdXMgciA9IDE5Mi81LiBUaGlzIGlzIGluIHNpbXBsZXN0IGZvcm0gKDE5MiBhbmQgNSBhcmUgY29wcmltZSkuIFNvIG0gPSAxOTIsIG4gPSA1LCBtK24gPSAxOTcuCgoqKkFuc3dlcjoqKiBcYm94ZWR7MTk3fQoKKipDb25maWRlbmNlOioqIEhpZ2g=)\*\*ImprovedSolution:\*\*Let’sdenote:\-Forthe34\-radiuscircles:n1=8,r1=34\-Forthe1\-radiuscircles:n2=2024,r2=1ConsiderthesequenceofcirclestangenttosidesABandBC\.TheircenterslieonalineparalleltosideAC\.LetLbethelengthofsideAC\.ThelengthofthesegmentparalleltoACatdistancedfromACisL\*\(r\-d\)/r\.Thislengthshouldbeequalto\(n\-1\)\*2r1becausethedistancebetweenconsecutivecentersis2r1andthereare\(n\-1\)gapsbetweenncircles\.Sowehave:L\*\(r\-r1\)/r=\(n\-1\)\*2r1Forthefirstsequence\(r1=34,n=8\):L\*\(r\-34\)/r=7\*2\*34L\*\(r\-34\)/r=476Forthesecondsequence\(r1=1,n=2024\):L\*\(r\-1\)/r=2023\*2\*1L\*\(r\-1\)/r=4046Nowwehavetwoequations:1\)L\*\(r\-34\)/r=4762\)L\*\(r\-1\)/r=4046Divideequation2byequation1:\(L\*\(r\-1\)/r\)/\(L\*\(r\-34\)/r\)=4046/476\(r\-1\)/\(r\-34\)=4046/476Simplify4046/476:Bothdivisibleby2:2023/238\.238=2\*7\*17;2023=7\*17^2\.So2023/238=17/2\.So:\(r\-1\)/\(r\-34\)=17/22\(r\-1\)=17\(r\-34\)2r\-2=17r\-578576=15rr=576/15=192/5Sotheinradiusr=192/5\.Thisisinsimplestform\(192and5arecoprime\)\.Som=192,n=5,m\+n=197\.\*\*Answer:\*\*\\boxed\{197\}\*\*Confidence:\*\*High

JudgeMigration signal\.The useful behavior is not just that RL selects a different team size\. The reflection workflow routes the contradiction back through a critic, and the solver’s final pass changes the mathematical model\. The repair is visible in the exact equations: the failed first attempt collapses to14=404614=4046, while the refined attempt usesL\(r−ρ\)/r=2ρ\(n−1\)L\(r\-\\rho\)/r=2\\rho\(n\-1\), givingr=192/5r=192/5andm\+n=197m\+n=197\.
MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

Similar Articles

EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale

Recursive Multi-Agent Systems

@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2054201045346287766

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

Submit Feedback

Similar Articles

EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale
@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2054201045346287766
Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence