When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

arXiv cs.AI 05/26/26, 04:00 AM Papers
multi-agent reinforcement-learning llm-workflows policy-sharing tradeoffs empirical-study
Summary
This paper studies when end-to-end reinforcement learning training improves multi-agent LLM workflows, comparing shared-policy and isolated-policy training across different workflows, tasks, and model scales, revealing conditional tradeoffs.
arXiv:2605.24202v1 Announce Type: new Abstract: Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.
Original Article
View Cached Full Text
Cached at: 05/26/26, 09:06 AM
# When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs
Source: [https://arxiv.org/html/2605.24202](https://arxiv.org/html/2605.24202)
Yifan Zeng1, Yiran Wu2, Yaolun Zhang1 Wentian Zhao3, Kun Wan3, Qingyun Wu2,4, Huazheng Wang1,4 1Oregon State University2Pennsylvania State University 3Adobe Inc\.4AG2AI, Inc\. \{zengyif, zhanyaol, huazheng\.wang\}@oregonstate\.edu \{yiran\.wu, qingyun\.wu\}@psu\.edu, \{wezhao, kuwan\}@adobe\.com

###### Abstract

Multi\-agent LLM workflows route inference through specialized roles to lift end\-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood\. We study when end\-to\-end RL training of multi\-agent LLM workflows improves over their base models, comparingShared\-Policytraining, where all roles update one policy, withIsolated\-Policytraining, where each role has its own parameters\. Our experimental matrix spans Eval\-Opt, Voting, and Orch\-Workers workflows, math and code tasks, and three model scales \(0\.6B, 1\.7B, 4B\)\. We find that multi\-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone\.IPtends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, whileSPtraining does not eliminate failure; it redistributes failure into qualitatively different patterns\. We then explain the strongest of these patterns through role\-level gradient dynamics induced by workflow topology and policy routing: underIP, parallel same\-role agents on shared prompts amplify per\-role gradients and drive terminal degradation in Voting and Orch\-Workers workflows; underSP, asymmetric per\-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow\. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow\- and task\-conditional tradeoffs\.

## 1Introduction

Multi\-agent LLM workflows have become a common inference\-time way to extract more capability than a single forward pass can deliver, often by decomposing generation, evaluation, aggregation, or delegation across multiple model calls\(Yanget al\.,[2026](https://arxiv.org/html/2605.24202#bib.bib19); Wuet al\.,[2023](https://arxiv.org/html/2605.24202#bib.bib25); Madaanet al\.,[2023](https://arxiv.org/html/2605.24202#bib.bib23); Wanget al\.,[2023](https://arxiv.org/html/2605.24202#bib.bib24)\)\. In parallel, reinforcement learning with verifiable rewards \(RLVR\) has become a standard way to improve the reasoning capabilities of large language models \(LLMs\), with notable success on mathematics and code\(Shaoet al\.,[2024](https://arxiv.org/html/2605.24202#bib.bib8); Guoet al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib32); Yuet al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib26); Zhaoet al\.,[2025a](https://arxiv.org/html/2605.24202#bib.bib5)\)\. More recently, RLVR has been applied to train multi\-step LLM agents, where task success can be evaluated by an outcome signal at the end of an episode\(Jinet al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib2); Weiet al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib3); Wanget al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib4)\)\. This progression naturally extends to multi\-agent systems, where the workflow consists of multiple interacting components and the final verifiable outcome can provide a reward signal for optimizing the system end to end\.

Recent work has begun extending Group Relative Policy Optimization\(Shaoet al\.,[2024](https://arxiv.org/html/2605.24202#bib.bib8)\)to multi\-agent settings\(Zhaoet al\.,[2025b](https://arxiv.org/html/2605.24202#bib.bib1); Liuet al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib9); Honget al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib10); Chenet al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib11); Fenget al\.,[2026](https://arxiv.org/html/2605.24202#bib.bib13)\)\. These efforts show that RL can be applied to specific multi\-agent workflows, such as debate, search, or hierarchical tool use, but they do not provide a systematic picture of when multi\-agent RL improves performance or why training succeeds or fails\. This paper therefore asks a broader question:when does reinforcement learning improve a multi\-agent workflow, and what training dynamics explain the outcome?

To answer this question, we run a systematic study acrossthree workflows\(Eval\-Opt, Voting, and Orch\-Workers\),three model scales\(0\.6B, 1\.7B, 4B\),two tasks\(math and code\), andtwo policy\-sharing strategies\(a shared policy used by every role, and an isolated policy per role\)\. Every cell is compared against a base\-model control and a single\-agent RL control at matched scale and task, so we can separate the gain attributable to multi\-agent training from the gain that single\-agent RL alone would already produce\. Figure[1](https://arxiv.org/html/2605.24202#S1.F1)shows the workflow topology and policy routing under study\.

![Refer to caption](https://arxiv.org/html/2605.24202v1/x1.png)Figure 1:Workflow topology and policy routing\.Three workflows \(Eval\-Opt, Voting, Orch\-Workers\), each trained underIsolated\-Policy\(oneπrole\\pi\_\{\\text\{role\}\}per role\) orShared\-Policy\(oneπshared\\pi\_\{\\text\{shared\}\}for all roles\), with a shared outcome\-reward GRPO loop\.From this design, we gain three empirical insights about when multi\-agent RL training improves over the base workflow\.\(1\)multi\-agent RL training usually improves over base models, but the choice of policy\-sharing strategy trades ceiling against floor:Isolated\-Policyroutes more often reach higher peak accuracy yet are more prone to late\-training degradation, whileShared\-Policyroutes are conservative on the upside and still subject to late\-training drift\.\(2\)whether and how much joint training helps depends jointly on workflow and task, not on the policy\-sharing axis alone: the same policy\-sharing choice has different consequences across Eval\-Opt, Voting, and Orch\-Workers workflows, and across math and code\.\(3\)SPtraining redistributes degradation rather than eliminating it, and someSPfailure patterns hide from token\-level training metrics and surface only at the episode\-correctness layer \(§[3](https://arxiv.org/html/2605.24202#S3), §[4](https://arxiv.org/html/2605.24202#S4)\)\. We then explain the strongest of these insights through two role\-level gradient mechanisms induced by workflow topology and policy routing:gradient\_amplificationunderIsolated\-Policy, andsp\_role\_captureunderShared\-Policy\(§[5](https://arxiv.org/html/2605.24202#S5)\)\.

## 2Related Work

##### Multi\-agent RL training of LLM workflows\.

A growing body of work extends reinforcement learning, especially GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.24202#bib.bib8)\), to multi\-agent LLM systems\. One line of work develops GRPO\-style training objectives for different multi\-agent workflow structures\. AT\-GRPO\(Zhaoet al\.,[2025b](https://arxiv.org/html/2605.24202#bib.bib1)\)introduces agent\- and turn\-wise grouping with tree\-structured sampling, MAGRPO\(Liuet al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib9)\)formulates LLM collaboration as a Dec\-POMDP with centralized group\-relative advantages and decentralized execution, M\-GRPO\(Honget al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib10)\)addresses hierarchical multi\-agent systems through trajectory alignment for variable\-length sub\-agent invocations, and MHGPO\(Chenet al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib11)\)designs heterogeneous group advantages for multi\-agent search pipelines\. A second line studies policy adaptation and reinforcement fine\-tuning at the system level, including MPDF\(Yang and Thomason,[2025](https://arxiv.org/html/2605.24202#bib.bib14)\), which learns adaptive meta\-policies for agent deliberation through rank\-based RL, and MARFT\(Liaoet al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib12)\), which provides a conceptual framework for multi\-agent reinforcement fine\-tuning with LoRA adapters\. Another recent direction analyzes optimization stability in multi\-agent RL; for example, Dr\. MAS\(Fenget al\.,[2026](https://arxiv.org/html/2605.24202#bib.bib13)\)identifies gradient\-norm imbalance caused by global advantage normalization across agents and proposes agent\-wise normalization\. Overall, these methods advance multi\-agent RL by introducing algorithmic components tailored to specific workflow patterns or optimization issues\. Our work is complementary: rather than proposing a new training algorithm, we conduct a controlled cross\-workflow, cross\-scale, cross\-task empirical study with base\-model and single\-agent\-RL controls, and explain the strongest observed patterns through role\-level gradient dynamics induced by workflow topology and policy routing\.

##### Diversity collapse and role drift in LLM RL\.

Several recent threads address related symptoms in single\-policy and multi\-agent LLM RL\.Longet al\.\([2025](https://arxiv.org/html/2605.24202#bib.bib36)\)treat diversity collapse under reinforcement learning with verifiable reward as a divergence\-choice problem and argue for replacing forward KL with alternatives that better preserve sample diversity\.Wanget al\.\([2026](https://arxiv.org/html/2605.24202#bib.bib37)\)detect and repair role drift in multi\-agent collaboration through lightweight protocol\-level monitors\.Zhanget al\.\([2025](https://arxiv.org/html/2605.24202#bib.bib38)\)address the converse pathology of agent under\-contribution in multi\-agent reasoning, where one agent dominates and others degenerate into passive roles\. Our work is structural: we identify which workflow topologies and policy\-routing choices produce role drift in the first place, and explain the strongest observed patterns through gradient mechanisms \(gradient\_amplificationunderIsolated\-Policy,sp\_role\_captureunderShared\-Policy\)\.

## 3Experimental Setup

Our experiments are organized as a controlled grid over four dimensions: task, model scale, policy\-routing strategy, and workflow \(Figure[1](https://arxiv.org/html/2605.24202#S1.F1)\)\. The task dimension covers mathematical reasoning and code\-generation settings; the model dimension varies the size of the underlying base model; the policy\-routing dimension compares shared and isolated policies; and the workflow dimension evaluates three different topologies with distinct role structures and communication patterns\. This grid allows us to separate effects that arise from the task domain, the model scale, the way policies are shared across roles, and the topology of the agent system itself\. Each workflow–task–scale–policy configuration is trained once with a fixed seed, and we include both a base\-model evaluation with no training and a single\-agent reinforcement\-learning control trained on the same task and scale\. These controls allow the later analysis to distinguish general RL\-induced drift from drift that is specific to multi\-agent interaction\. Training uses GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.24202#bib.bib8)\)without an explicit KL penalty against a reference policy: each training step drawsn=8n=8workflow rollouts per problem and computes the relative advantage within the resulting group of rollouts\.

Dataset\.We use DAPO\-Math\-17K\(Yuet al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib26)\), which we refer to as Math, and DeepCoder\(Luoet al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib29)\), which we refer to as Code\. Math evaluates multi\-step mathematical reasoning, while Code evaluates code\-oriented problem solving\.Models\.Across both tasks, the underlying model family is Qwen3\(Yanget al\.,[2025](https://arxiv.org/html/2605.24202#bib.bib31)\), instantiated at three scales: Qwen3\-0\.6B, Qwen3\-1\.7B, and Qwen3\-4B\. Using the same model family across scales keeps the architecture fixed while allowing us to test whether role specialization, policy drift, and workflow\-level behavior change as model capacity increases\.

### 3\.1Multi\-Agent Workflows

We study three multi\-agent workflows: Eval\-Opt, Voting, and Orch\-Workers\.*Eval\-Opt*is a two\-role workflow consisting of a generator and an evaluator\. The generator produces an initial answer, while the evaluator judges the answer, provides a verdict and critique, and may induce revision\. This workflow tests whether separating solution generation from evaluation creates useful role specialization or instead causes one role to dominate the optimization dynamics\.

*Voting*is a two\-role workflow with three generators and one aggregator\. The three generators independently produce candidate answers, and the aggregator selects or combines among them\. This workflow introduces same\-role multiplicity: the three generator slots play the same functional role but may produce different outputs for the same prompt\. We therefore track not only the aggregator’s final decision but also the distribution of generator answers, the aggregator’s selection behavior, and the pass@kkoracle over the three generator outputs\.

*Orch\-Workers*is a three\-role workflow consisting of an orchestrator, three workers, and a synthesizer\. The orchestrator proposes a plan or decomposition, the workers generate candidate solutions or partial responses, and the synthesizer produces the final answer\. Like Voting, this workflow includes same\-role multiplicity, but the repeated role appears in a more explicitly hierarchical pipeline\. This makes Orch\-Workers useful for studying whether coordination roles and worker roles drift differently under the same outcome\-level training signal\.

### 3\.2Policy\-Sharing Strategy

We compare two policy\-routing strategies:Shared\-Policy\(SP\) andIsolated\-Policy\(IP\)\. UnderSP, all roles in a workflow use a single shared policy, so gradients from generators, evaluators, aggregators, orchestrators, workers, and synthesizers all update the same parameters\. This setting tests whether one policy can support multiple conversational and functional roles without role interference\. UnderIP, each distinct role is assigned its own role\-specific policy adapter\. Same\-role multiplicity does not create separate adapters: for example, the three generators in Voting share one generator adapter, and the three workers in Orch\-Workers share one worker adapter\. Thus,IPisolates policies by role type rather than by individual agent instance\. This design lets us compare role\-specialized learning against fully shared learning while keeping the treatment of repeated same\-role agents consistent across workflows\.

### 3\.3Analysis Protocol

The analysis follows the structure of the paper\. §[4](https://arxiv.org/html/2605.24202#S4)compares outcomes across the workflow–task–scale–policy grid\. §[5](https://arxiv.org/html/2605.24202#S5)uses role\-level and trajectory\-level evidence to explain the strongest empirical patterns, especially the higher\-ceiling\-and\-lower\-floor behavior ofIsolated\-Policytraining and the failure redistribution observed underShared\-Policytraining\. §[A\.6](https://arxiv.org/html/2605.24202#A1.SS6)translates these findings into design implications and monitoring recommendations\.

## 4Results

Table 1:Validation accuracy \(%\) across the task×\\timesscale×\\timesworkflow matrix, with a single\-agent RL \(SA\-RL\) baseline\. Each cell reports the peak validation accuracy; IP and SP residuals are MA\-RL minus SA\-RL in percentage points\. SA\-RL is workflow\-agnostic and reported once per \(task, scale\) block\. Code abbreviates DeepCoder\.We present the empirical landscape of multi\-agent RL training across the workflow, scale, task, and policy\-sharing matrix\. §[4\.1](https://arxiv.org/html/2605.24202#S4.SS1)establishes that multi\-agent RL usually improves over the base model\. §[4\.2](https://arxiv.org/html/2605.24202#S4.SS2)contrastsIsolated\-PolicyandShared\-Policyroutes on validation accuracy and on training\-side instability amplitude\. §[4\.3](https://arxiv.org/html/2605.24202#S4.SS3)reads the workflow and task interactions\. §[4\.4](https://arxiv.org/html/2605.24202#S4.SS4)closes the empirical claim by showing thatSPtraining changes which failure patterns appear rather than removing them\.

![Refer to caption](https://arxiv.org/html/2605.24202v1/x2.png)Figure 2:Per\-cell validation accuracy delta against the base model, across the workflow×\\timesscale×\\timestask matrix and colored by policy \(IP/SP\)\.![Refer to caption](https://arxiv.org/html/2605.24202v1/x3.png)Figure 3:IPvsSPvalidation accuracy on matched cells\. Markers above the diagonal favorIP\. Marker shape encodes workflow, fill color encodes task \(Math vs Code\), and marker size encodes scale\.### 4\.1Multi\-Agent RL Usually Improves Over Base Models

Across the workflow, scale, and task matrix, the headline pattern is positive: multi\-agent RL training reaches higher validation accuracy than the corresponding base model in the large majority of cells, and on most cells the multi\-agent accuracy matches or exceeds the single\-agent RL baseline at the same scale and task\. Table[1](https://arxiv.org/html/2605.24202#S4.T1)reports per\-cell base accuracy, the single\-agent RL accuracy, theIPandSPmulti\-agent accuracies, and theIPandSPresiduals against the single\-agent baseline; Fig\.[2](https://arxiv.org/html/2605.24202#S4.F2)shows the same landscape as a per\-cell delta against the base model, grouped by workflow, task, and scale\.

The improvements are not uniform\. On a minority of cells the multi\-agent accuracy slips below the single\-agent baseline at matched scale and task; the cells where this happens are visible directly in Table[1](https://arxiv.org/html/2605.24202#S4.T1)as negative residuals\. Multi\-agent RL is therefore usually a net win over the base model, but the share of that win attributable to multi\-agent training \(rather than to RL on the underlying single\-agent policy\) varies across cells, and the variation has structure that the next subsections expose\. Table[3](https://arxiv.org/html/2605.24202#A1.T3)in Appendix §[A\.2](https://arxiv.org/html/2605.24202#A1.SS2)reports the matched single\-agent RL policy run on the each multi\-agent workflow, isolating the workflow contribution from the multi\-agent training contribution\.

### 4\.2Isolated Policies Have a Higher Ceiling but Lower Floor

ComparingIsolated\-PolicyandShared\-Policyat matched workflow, scale, and task reveals a consistent shape: on the cells where multi\-agent training helps,IProutes more often reach the higher accuracy\. The matched comparison is summarized in Fig\.[3](https://arxiv.org/html/2605.24202#S4.F3), which plotsIPaccuracy againstSPaccuracy with a diagonal reference; the points cluster above the diagonal across most matched cells\. This is the higher\-ceiling half of the claim\.

The lower\-floor half is more subtle\. We support it with two complementary readings: a within\-\(scale, task\) training\-dynamics reading restricted to one cell group where termination conditions are uniform across rows, and a cross\-experiment training\-side amplitude comparison\.

##### Within\-cell training dynamics at 1\.7B×\\timesMath\.

At fixed \(scale, task\)==1\.7B×\\timesMath, training\-side dynamics are plotted for each multi\-agent workflow’sIsolated\-PolicyandShared\-Policyruns over each workflow’s full training run\. Figure[4](https://arxiv.org/html/2605.24202#S4.F4)shows training success rate \(top row\) and per\-checkpoint validation accuracy \(bottom row\), one column per workflow\. Across all three multi\-agent workflowsIPrises faster early and reaches a higher peak; at later training stepsIPfalls back toward or belowSPwhileSPplateaus at a lower but stable validation accuracy\. The within\-cell shape is the within\-\(1\.7B, Math\) signature of the lower\-floor claim; the same comparison across \(scale, task\) groups would mix runs that terminate under different training\-budget conditions\.

![Refer to caption](https://arxiv.org/html/2605.24202v1/x4.png)Figure 4:Training\-side dynamics at 1\.7B×\\timesMath across the three multi\-agent workflows \(Eval\-Opt, Voting, Orch\-Workers\)\. Top row: training success rate\. Bottom row: per\-checkpoint validation accuracy\.
##### Cross\-experiment amplitude: training\-side instability is larger underIP\.

Within\-cell training dynamics at 1\.7B×\\timesMath \(Fig\.[4](https://arxiv.org/html/2605.24202#S4.F4)\) only extend across a single cell group\. A run\-length\-tolerant statistic that does extend across the matched matrix is the maximum amplitude of a training\-side instability metric over the full training run\. Fig\.[5](https://arxiv.org/html/2605.24202#S4.F5)reports three such statistics per matched cell and per policy: the maximum token\-level policy ratio \(χ2\\chi^\{2\}\), the maximum gradient norm, and the entropy collapse depth \(initial entropy minus the minimum entropy on the rollout actor’s per\-token entropy series\)\. Across the matched matrix,Isolated\-Policyshows systematically larger gradient\-norm amplitude thanShared\-Policy, and larger policy\-ratio amplitude on most cells; the exceptions are taken up in §[4\.4](https://arxiv.org/html/2605.24202#S4.SS4)\.

![Refer to caption](https://arxiv.org/html/2605.24202v1/x5.png)Figure 5:Training\-side instability amplitude on matchedIP\-vs\-SPcells\. Three statistics per cell and policy:max\\maxtoken\-level policy ratio \(χ2\\chi^\{2\}\),max\\maxadapter gradient norm, and entropy collapse depth \(initial−\-minimum on the rollout actor’s per\-token entropy\)\. Each statistic is the maximum \(or initial\-minus\-min for entropy collapse\) over the full training run, a run\-length\-tolerant cross\-experiment statistic\.IPcarries larger gradient\-norm amplitude thanSPacross matched cells, and larger policy\-ratio amplitude on most cells; the exceptions on policy\-ratio amplitude are detailed in §[4\.4](https://arxiv.org/html/2605.24202#S4.SS4)\. Entropy collapse depth is small across both policies and less discriminating\. Workflow codes: E = Eval\-Opt, V = Voting, O = Orch\-Workers\.The pattern is therefore thatIsolated\-PolicyandShared\-Policycarry different risk profiles\.IPcan climb to a higher accuracy ceiling \(Fig\.[3](https://arxiv.org/html/2605.24202#S4.F3)\), but within a single \(scale, task\) cell group theIPadvantage flips or shrinks at later training steps \(Fig\.[4](https://arxiv.org/html/2605.24202#S4.F4)\), and the same training trajectory drives larger training\-side amplitude across the matched matrix \(Fig\.[5](https://arxiv.org/html/2605.24202#S4.F5)\)\.SPplateaus earlier and lower, but its training\-side amplitude stays smaller across most cells\. §[4\.4](https://arxiv.org/html/2605.24202#S4.SS4)clarifies that theSPplateau’s lower training\-side amplitude does not translate to a stably higher validation trajectory at later training steps; someSPcells drift down through patterns that token\-level training metrics do not register\.

### 4\.3Improvements Depend on Workflow and Task

Reading Fig\.[2](https://arxiv.org/html/2605.24202#S4.F2)along the workflow and task axes shows that the magnitude and even the direction of multi\-agent improvement depend jointly on what the workflow is doing and on the task\. Eval\-Opt, Voting, and Orch\-Workers are not interchangeable: at matched scale and task, the per\-cell residuals against the single\-agent baseline differ across workflows, and the ranking of workflows is not constant across tasks or scales\. Math and Code also behave differently: at matched workflow, the same policy route can yield a clear gain on one task and slip below the single\-agent baseline on the other\.

### 4\.4Shared\-PolicyPlateaus Can Still Drift Down at Later Steps

Shared\-Policy’s smaller training\-side amplitude does not entail a stably higher validation trajectory at later training steps\. Two patterns clarify how\. First, the training\-amplitude advantage is not uniform: someSPcells \(Eval\-OptSP×\\timesCode, Eval\-OptSP×\\times1\.7B Math\) carry larger token\-level policy\-ratio amplitude than theirIPcounterparts\. Second, even on cells whereSP’s token\-level amplitude stays small, the validation trajectory can drift down at later training steps through patterns that token\-level metrics do not register\.

![Refer to caption](https://arxiv.org/html/2605.24202v1/x6.png)Figure 6:Training dynamics on matchedIP\-vs\-SPcliff cells\. \(a\) Voting\-IP\-1\.7B\-Math: per\-role generator diagnostics rise while the aggregator stays flat\. \(b\) Orch\-Workers\-SP\-4B\-Math: global training\-side diagnostics escalate ahead of a terminal validation cliff \(per\-role decomposition is unavailable underShared\-Policy, since a single shared adapter logs only global actor metrics\)\. \(c\) Matched 1\.7B Math Orch\-WorkersIP\-versus\-SPvalidation overlay\. Metrics in \(a\) and \(b\) are normalized to their first logged value\.Two cells make the point at 4B Math, where the headroom is largest and the routes’ behaviors separate most sharply\. AtVoting\-SP\-4B\-Math, validation accuracy plateaus and then drifts down within the run, and the trajectory\-level signature lives on the aggregator slot: the role designed to emit a terse selection stamp migrates toward verbose justification at the validation\-cliff step, while voter\-side token\-level metrics such as inter\-voter Jaccard and per\-token entropy remain near their first logged values\. AtOrch\-Workers\-SP\-4B\-Math, the within\-experiment validation series falls off a late\-training cliff\. §[5\.2](https://arxiv.org/html/2605.24202#S5.SS2)takes up the role\-level dynamics behind these patterns\.

## 5Role\-Level Gradient Dynamics

§[4](https://arxiv.org/html/2605.24202#S4)reports two patterns: anIPshape with a higher ceiling and lower floor \(§[4\.2](https://arxiv.org/html/2605.24202#S4.SS2)\), and anSPshape whose plateaus can drift down at later steps \(§[4\.4](https://arxiv.org/html/2605.24202#S4.SS4)\)\. Both share a common cause: role\-level gradient pressure on a single role’s parameters, with the policy\-sharing choice selecting the surface shape\. §[5\.1](https://arxiv.org/html/2605.24202#S5.SS1)takes up theIPshape; §[5\.2](https://arxiv.org/html/2605.24202#S5.SS2)takes up theSPshape\.

Both mechanisms can leave parallel same\-role slots drawing from a narrowed policy, and the shared inference\-time consequence on those slots is the same across both shapes\. When the narrowed mode lands on a wrong answer, the parallel slots are more likely to emit the same wrong answer than under the base model, sharpening slot\-level agreement on the wrong\-answer subpopulation\.

The two mechanisms have different manifestation geometries\. TheIPmechanism in §[5\.1](https://arxiv.org/html/2605.24202#S5.SS1)has a single gradient source whose surface is selected by the workflow; theSPmechanism in §[5\.2](https://arxiv.org/html/2605.24202#S5.SS2)has two distinct sources of gradient asymmetry, each with task\- and workflow\-conditional surfaces\.

### 5\.1Same\-Role Parallelism Amplifies Role\-Level Gradients UnderIsolated\-Policy

When a workflow containsN\>1N\>1same\-role agents that see the same or related prompts and receive the same outcome reward, and those agents are routed through a single per\-role isolated policy \(VotingIP, Orch\-WorkersIP\), the role’s parameters receiveNNgradient samples per training step whose advantage signs co\-vary\. Because theNNsamples share task identity and a sign of advantage, the role’s gradient updates are coherent across theNNcopies\. The role’s effective per\-step update is amplified relative to a single\-agent baseline at the same scale and task, and the role drifts faster than the base policy can absorb\. We label this mechanismgradient\_amplification\. Two manifestations appear in our matrix, distinguished by the workflow that creates the same\-role parallelism\.

Voting workflow\.The parallelism falls on the generator role\.Voting\-IP\-1\.7B\-Mathdecomposes cleanly along role identity: the generator’sχ2\\chi^\{2\}and training perplexity climb sharply over training while the aggregator’s stay near their first logged values, and validation accuracy descends shortly after the generator’sχ2\\chi^\{2\}peak \(Fig\.[6](https://arxiv.org/html/2605.24202#S4.F6)panel \(a\)\)\.

Orch\-Workers workflow\.The parallelism falls on the worker role\. At fixed workflow, task, and scale,Orch\-Workers\-IP\-1\.7B\-Mathpeaks higher thanOrch\-Workers\-SP\-1\.7B\-Mathand falls farther, while theSPrun peaks lower and stays near its peak through the terminal step \(Fig\.[6](https://arxiv.org/html/2605.24202#S4.F6)panel \(c\)\)\. TheIPcurve rises and descends; theSPcurve rises and plateaus\. §[5\.2](https://arxiv.org/html/2605.24202#S5.SS2)takes up theSPshape\.

Per\-role peak\-over\-first\-step ratios for both manifestations are tabulated in Table[2](https://arxiv.org/html/2605.24202#S5.T2); trajectory\-level signatures are tabulated in Table[4](https://arxiv.org/html/2605.24202#A1.T4), with the full per\-cell breakdown in §[A\.3](https://arxiv.org/html/2605.24202#A1.SS3)\.

Table 2:Per\-role training\-dynamics signatures on matched IP\-vs\-SP cells\. Each row is one role under IP or the shared policy under SP; the three ratio columns are peak\-over\-first\-step ratios for token\-levelχ2\\chi^\{2\}, training perplexity, and adapter gradient norm, and the rightmost column is the step at which theχ2\\chi^\{2\}ratio peaks\. SP cells aggregate across roles into a single shared\-policy row; multi\-slot roles \(3 voters, 3 workers\) are logged as a single per\-role aggregate\. The grad\-norm column reports the gradient magnitude of a single optimizer step, before those steps accumulate across an iteration;gradient\_amplificationadds more same\-role updates per iteration, each in a similar direction, rather than enlarging any single update, so the mechanism shows up in theχ2\\chi^\{2\}and perplexity columns rather than in grad\-norm\.
### 5\.2Shared\-PolicyTraining Redistributes Role\-Level Gradients

Sharing parameters across roles redirects the gradient pressures that drive role\-level drift toward a different shared\-policy direction\. The unifying pattern across ourShared\-Policycells is*shared\-policy capture by the dominant role*: when one role contributes more, or more distinctive, per\-step gradient mass than the others, the shared policy shifts toward the dominant role’s distribution, and the captured role’s slot starts producing dominant\-role\-typical outputs at evaluation time\. We label this mechanismsp\_role\_capture\. Two manifestations appear in our matrix, distinguished by the source of the per\-step gradient asymmetry\.

Token\-distribution asymmetry\.One role emits longer or more distinctive token sequences than the others and so contributes the dominant share of per\-step gradient mass; the captured slot drifts toward the dominant role’s idiom\.Eval\-Opt\-SP\-0\.6B\-Codesurfaces as code\-like emission in the evaluator slot, where the base model’s code prior compounds the asymmetry\.Eval\-Opt\-SP\-1\.7B\-Mathsurfaces as long\-form re\-solve in the evaluator slot, where the Math reward landscape tolerates length inflation as the surface\.Voting\-SP\-4B\-Mathsurfaces as aggregator\-slot drift: the aggregator’s terse\-stamp slot is captured by the generators’ long\-form idiom, with aggregator emitted length climbing sharply at the validation\-cliff step while voter\-side training metrics remain near their first logged values\.

Per\-episode frequency asymmetry\.One role occupiesN\>1N\>1episode slots against fewer slots from other roles\.Orch\-Workers\-SP\-4B\-Mathsurfaces as worker\-shape capture under the 3 workers, 1 orchestrator, 1 synthesizer episode structure, manifesting as global training\-side amplitude escalation alongside a terminal\-step descent of validation accuracy \(Fig\.[6](https://arxiv.org/html/2605.24202#S4.F6)panel \(b\)\)\.

Trajectory\-level signatures for the token\-distribution cells are tabulated in Table[5](https://arxiv.org/html/2605.24202#A1.T5)\(Panels A1, A2, A3\); the per\-episode\-frequency cell is tabulated in Table[6](https://arxiv.org/html/2605.24202#A1.T6)\. The full per\-cell breakdown is in §[A\.4](https://arxiv.org/html/2605.24202#A1.SS4)\.

## 6Conclusion

We asked when multi\-agent RL training improves LLM workflows over their base models, and what governs the stability of the training trajectories that produce those improvements\. The empirical answer is two\-sided\. Multi\-agent RL training usually does improve workflow performance over the base model, but the trajectories themselves are unstable in workflow\-, scale\-, task\-, and policy\-sharing\-dependent ways\.Isolated\-Policyoften reaches higher peaks, yet suffers terminal degradation cliffs late in training\.Shared\-Policyredistributes the underlying drift into shared\-policy capture by the dominant role, surfacing as code emission in the evaluator slot under Eval\-Opt Code, length inflation in the evaluator slot under Eval\-Opt Math, aggregator\-slot drift under Voting Math, and worker\-shape capture under Orch\-Workers Math\.

The strongest of these patterns are explained by role\-level gradient dynamics created jointly by workflow topology and policy routing: same\-role gradient amplification underIsolated\-Policy\(gradient\_amplification\), and shared\-policy capture by the dominant role underShared\-Policy\(sp\_role\_capture\)\.SPandIProute training pressure through different channels, each with its own characteristic failure surface\. The empirical landscape, together with the gradient mechanisms behind it, reframesSPtraining from a default safety knob into an auditable design choice that practitioners should select with workflow topology, role multiplicity, and task fit explicitly in view\. In practice, this means selecting the policy\-sharing strategy at the workflow level and monitoring per\-role drift signatures rather than aggregate accuracy alone\. The bounds on these claims, including the LoRA substrate, the outcome\-reward setting, and the single\-seed\-per\-cell design, are stated in §[A\.1](https://arxiv.org/html/2605.24202#A1.SS1)\.

## References

- End\-to\-end optimization of LLM\-driven multi\-agent search systems via heterogeneous\-group\-based reinforcement learning\.arXiv preprint arXiv:2506\.02718\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p2.1),[§2](https://arxiv.org/html/2605.24202#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Feng, L\. Zheng, S\. He, F\. Zhang, and B\. An \(2026\)Dr\. MAS: stable reinforcement learning for multi\-agent LLM systems\.arXiv preprint arXiv:2602\.08847\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p2.1),[§2](https://arxiv.org/html/2605.24202#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p1.1)\.
- H\. Hong, J\. Yin, Y\. Wang, J\. Liu, Z\. Chen, A\. Yu, J\. Li,et al\.\(2025\)Multi\-agent deep research: training multi\-agent systems with M\-GRPO\.arXiv preprint arXiv:2511\.13288\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p2.1),[§2](https://arxiv.org/html/2605.24202#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. O\. Arik, D\. Wang, H\. Zamani, and J\. Han \(2025\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.arXiv preprint arXiv:2503\.09516\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p1.1)\.
- J\. Liao, M\. Wen, J\. Wang, and W\. Zhang \(2025\)MARFT: multi\-agent reinforcement fine\-tuning\.arXiv preprint arXiv:2504\.16129\.Cited by:[§2](https://arxiv.org/html/2605.24202#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Liu, T\. Chen, Z\. Liang, X\. Lyu, and C\. Amato \(2025\)LLM collaboration with multi\-agent reinforcement learning\.arXiv preprint arXiv:2508\.04652\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p2.1),[§2](https://arxiv.org/html/2605.24202#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Long, Z\. Zhou, J\. Hao, J\. Liu, Y\. Miao, W\. Pang, X\. Tan, W\. Chu, Z\. Wang, S\. Pan, C\. Qu, and Y\. Qi \(2025\)The choice of divergence: a neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward\.arXiv preprint arXiv:2509\.07430\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2509.07430)Cited by:[§2](https://arxiv.org/html/2605.24202#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Luo, S\. Tan, R\. Huang, A\. Patel, A\. Ariyak, Q\. Wu, X\. Shi, R\. Xin, C\. Cai, M\. Weber,et al\.\(2025\)Deepcoder: a fully open\-source 14b coder at o3\-mini level\.Notion Blog1\.Cited by:[§3](https://arxiv.org/html/2605.24202#S3.p2.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.Advances in Neural Information Processing Systems\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p1.1)\.
- J\. Schulman and T\. M\. Lab \(2025\)LoRA without regret\.Thinking Machines Lab: Connectionism\.Note:https://thinkingmachines\.ai/blog/lora/External Links:[Document](https://dx.doi.org/10.64434/tml.20250929)Cited by:[§A\.1](https://arxiv.org/html/2605.24202#A1.SS1.p2.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p1.1),[§1](https://arxiv.org/html/2605.24202#S1.p2.1),[§2](https://arxiv.org/html/2605.24202#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.24202#S3.p1.1)\.
- F\. Wang, H\. Cui, L\. Yang, C\. S\. Lee, Z\. Li, and C\. Wen \(2026\)Detecting and repairing role drift in multi\-agent collaboration with lightweight protocols\.In2026 International Conference on Communication Networks and Machine Learning \(CNML\),External Links:[Document](https://dx.doi.org/10.1109/CNML68938.2026.11453032)Cited by:[§2](https://arxiv.org/html/2605.24202#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.Proceedings of the International Conference on Learning Representations\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p1.1)\.
- Z\. Wang, K\. Wang, Q\. Wang, P\. Zhang, L\. Li, Z\. Yang, X\. Jin, K\. Yu, M\. N\. Nguyen, L\. Liu, E\. Gottlieb, Y\. Lu, K\. Cho, J\. Wu, L\. Fei\-Fei, L\. Wang, Y\. Choi, and M\. Li \(2025\)RAGEN: understanding self\-evolution in llm agents via multi\-turn reinforcement learning\.arXiv preprint arXiv:2504\.20073\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p1.1)\.
- Z\. Wei, W\. Yao, Y\. Liu, W\. Zhang, Q\. Lu, L\. Qiu, C\. Yu, P\. Xu, C\. Zhang, B\. Yin, H\. Yun, and L\. Li \(2025\)WebAgent\-r1: training web agents via end\-to\-end multi\-turn reinforcement learning\.arXiv preprint arXiv:2505\.16421\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p1.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu,et al\.\(2023\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversation\.arXiv preprint arXiv:2308\.08155\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Wang,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§3](https://arxiv.org/html/2605.24202#S3.p2.1)\.
- W\. Yang and J\. Thomason \(2025\)Learning to deliberate: meta\-policy collaboration for agentic LLMs with multi\-agent reinforcement learning\.arXiv preprint arXiv:2509\.03817\.Cited by:[§2](https://arxiv.org/html/2605.24202#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Yang, C\. Qu, M\. Wen, L\. Shi, Y\. Wen, W\. Zhang, A\. Wierman, and S\. Gu \(2026\)Understanding agent scaling in LLM\-based multi\-agent systems via diversity\.arXiv preprint arXiv:2602\.03794\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, T\. Fan, G\. Liu, L\. Liu, X\. Liu,et al\.\(2025\)DAPO: an open\-source LLM reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p1.1),[§3](https://arxiv.org/html/2605.24202#S3.p2.1)\.
- Z\. Zhang, X\. Li, Y\. Lin, H\. Liu, R\. Chandradevan, L\. Wu, M\. Lin, F\. Wang, X\. Tang, Q\. He, and S\. Wang \(2025\)Unlocking the power of multi\-agent LLM for reasoning: from lazy agents to deliberation\.arXiv preprint arXiv:2511\.02303\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2511.02303)Cited by:[§2](https://arxiv.org/html/2605.24202#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Zhao, Y\. Wu, Y\. Yue, T\. Wu, Q\. Xu, M\. Lin, S\. Wang, Q\. Wu, Z\. Zheng, and G\. Huang \(2025a\)Absolute zero: reinforced self\-play reasoning with zero data\.arXiv preprint arXiv:2505\.03335\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p1.1)\.
- Y\. Zhao, L\. Hu, Y\. Wang, M\. Hou, H\. Zhang, K\. Ding, and J\. Zhao \(2025b\)Stronger\-mas: multi\-agent reinforcement learning for collaborative llms\.arXiv preprint arXiv:2510\.11062\.Cited by:[§1](https://arxiv.org/html/2605.24202#S1.p2.1),[§2](https://arxiv.org/html/2605.24202#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix AAppendix

This appendix collects the design implications, monitoring recommendations, and limitations that follow from the empirical and mechanistic findings in the body of the paper, together with the implementation specifics needed to reproduce the experiments\. §[A\.1](https://arxiv.org/html/2605.24202#A1.SS1)states the limitations of the study; §[A\.2](https://arxiv.org/html/2605.24202#A1.SS2)describes the single\-agent reinforcement\-learning baseline used in Table[1](https://arxiv.org/html/2605.24202#S4.T1); §[A\.3](https://arxiv.org/html/2605.24202#A1.SS3)reports trajectory\-level signatures for the twoIPcells named in §[5\.1](https://arxiv.org/html/2605.24202#S5.SS1); §[A\.4](https://arxiv.org/html/2605.24202#A1.SS4)reports trajectory\-level signatures for theSPcells named in §[5\.2](https://arxiv.org/html/2605.24202#S5.SS2); §[A\.5](https://arxiv.org/html/2605.24202#A1.SS5)gives the training hyperparameters; §[A\.6](https://arxiv.org/html/2605.24202#A1.SS6)discusses design implications and monitoring recommendations; §[A\.7](https://arxiv.org/html/2605.24202#A1.SS7)states the reward functions and the per\-role advantage assignment; and §[A\.8](https://arxiv.org/html/2605.24202#A1.SS8)summarizes compute\.

### A\.1Limitations

Several limitations bound the claims in this paper\.

LoRA substrate\.All training runs use LoRA adapters, a parameter\-efficient setup that has been reported to match full fine\-tuning on RL workloads at the scales we use\(Schulman and Lab,[2025](https://arxiv.org/html/2605.24202#bib.bib6)\)\. The cross\-role gradient mechanisms in §[5](https://arxiv.org/html/2605.24202#S5)are therefore not specific to LoRA, and the same gradient pressures would act under full\-parameter training\. One axis our analysis does not characterize is how adapter capacity interacts with the base\-model role prior, which would require LoRA\-rank or full\-parameter ablations that we leave to future work\.

Outcome\-reward setting only\.All experiments use outcome reward, scoring only the final answer\. We do not study process\-reward workflows in which intermediate role outputs receive their own reward signal\. Process rewards change the reward geometry on each role and may suppress, amplify, or recompose the patterns we report\.

Single seed per cell\.Each workflow×\\timestask×\\timesscale×\\timespolicy combination is trained once with a fixed seed\. The cross\-cell consistency ofIsolated\-Policy\-vs\-Shared\-Policycliff signatures across the matched cells in Tables[1](https://arxiv.org/html/2605.24202#S4.T1)and[2](https://arxiv.org/html/2605.24202#S5.T2), and in Fig\.[5](https://arxiv.org/html/2605.24202#S4.F5), is the substitute for repeated seeds, not equivalent to them; cell\-level effect sizes should be read with this in mind\.

### A\.2Single\-Agent RL Baseline Details

TheSA\-RLcolumn in Table[1](https://arxiv.org/html/2605.24202#S4.T1)reports a single\-agent reinforcement\-learning baseline at matched task and scale\. The baseline uses the same base model, the same task, the same training hyperparameters listed in Table[7](https://arxiv.org/html/2605.24202#A1.T7), and the same total training step budget as the multi\-agent runs at that scale\. The single\-agent rollout uses one generator role only: a single LoRA adapter is trained on rollouts from a single\-role workflow whose prompt template matches the multi\-agent generator prompt for the same task\. No evaluator, aggregator, orchestrator, worker, or synthesizer role is run, and no inter\-role context is available; the only role\-conditional input is the generator prompt template applied to the problem\.

The baseline is therefore matched to the multi\-agent runs in everything except \(a\) the workflow topology, which is single\-role rather than multi\-role; and \(b\) the policy\-routing strategy, which is a single adapter rather thanIsolated\-PolicyorShared\-Policy\. This design lets the residual columns in Table[1](https://arxiv.org/html/2605.24202#S4.T1)\(Isolated\-PolicyandShared\-PolicyminusSA\-RL\) isolate the share of the multi\-agent run’s accuracy that is attributable to multi\-agent training, separate from the share already produced by single\-agent reinforcement learning at the same task and scale\.

As a diagnostic on the same SA\-RL adapter, we additionally deploy the 1\.7B adapter only on the generator role inside each multi\-agent workflow; the evaluator, aggregator, orchestrator, worker, and synthesizer roles fall through to the base model\. Table[3](https://arxiv.org/html/2605.24202#A1.T3)reports validation accuracy across the three multi\-agent workflows on both tasks\. TheSA\-RLcolumn reproduces the matched single\-agent value from Table[1](https://arxiv.org/html/2605.24202#S4.T1)so the workflow effect on a generator\-only RL adapter can be read off directly\.

Table 3:Single\-agent generator transfer at 1\.7B\. The SA\-RL adapter is applied only on the generator role inside each multi\-agent workflow; supervisor roles use the base model\. Validation accuracy \(%\) on dapo\_math \(1412 problems\) and Code \(1000 problems\),nrollouts=1n\_\{\\text\{rollouts\}\}\{=\}1, training\-time length caps\. TheSA\-RLcolumn reproduces the matched value from Table[1](https://arxiv.org/html/2605.24202#S4.T1)\.
### A\.3Trajectory\-level signatures for thegradient\_amplificationcells in §[5\.1](https://arxiv.org/html/2605.24202#S5.SS1)

This subsection collects trajectory\-level signatures for the twoIPcells named in §[5\.1](https://arxiv.org/html/2605.24202#S5.SS1)\. Each cell is sampled at three trajectory checkpoints underdapo\_math, with 100 problems per checkpoint\.

Table[4](https://arxiv.org/html/2605.24202#A1.T4)Panel C1 reportsVoting\-IP\-1\.7B\-Math, where the same\-role parallelism falls on the generator role\. Panel C2 reportsOrch\-Workers\-IP\-1\.7B\-Math, where the parallelism falls on the worker role and uses the same columns as Table[6](https://arxiv.org/html/2605.24202#A1.T6)\.

Panel C2 and Table[6](https://arxiv.org/html/2605.24202#A1.T6)sit at different scales \(1\.7B vs 4B\); the scale\-matched IP\-vs\-SP validation\-accuracy contrast for Orch\-Workers on Math is in Fig\.[6](https://arxiv.org/html/2605.24202#S4.F6)panel \(c\)\.

Table 4:Per\-role trajectory\-level signatures of gradient amplification on the twoIPcells named in §[5\.1](https://arxiv.org/html/2605.24202#S5.SS1)\. Panel C1 \(Voting\-IP\-1\.7B\-Math\) concentrates the same\-role parallelism on the generator role\. The generator block reports the generator’s mean completion length, the rate at which the rollout hits the 5120\-token cap \(finish\_reason==length\), the rate at which the rollout contains a parseable\\boxed\{\}, the hedging\-phrase rate \(defined in §[A\.4](https://arxiv.org/html/2605.24202#A1.SS4)\), and pairwise inter\-generator 3\-gram Jaccard averaged within a problem and then across problems\. The aggregator column reports the aggregator’s median completion length\. Panel C2 \(Orch\-Workers\-IP\-1\.7B\-Math\) concentrates the parallelism on the worker role and uses the same columns as Table[6](https://arxiv.org/html/2605.24202#A1.T6)\.Panel C1\.Voting\-IP\-1\.7B\-Math\(dapo\_math, 100 problems / step, 3 generators / 1 aggregator per problem\)\.

Panel C2\.Orch\-Workers\-IP\-1\.7B\-Math\(dapo\_math, 100 problems / step, 3 workers / 1 orch / 1 synth per problem\)\.

### A\.4Trajectory\-level signatures for thesp\_role\_capturecells in §[5\.2](https://arxiv.org/html/2605.24202#S5.SS2)

This subsection collects trajectory\-level signatures for theSPcells named in §[5\.2](https://arxiv.org/html/2605.24202#S5.SS2)\. Each cell is sampled at three trajectory checkpoints underdapo\_math\(Math\) ordeepcoder\_primeintellect\(Code\), with 100 problems per checkpoint\. Throughout the signature tables in this appendix, the hedging\-phrase rate is the rate at which a rollout contains a phrase from the fixed listwait,alternatively,actually,hmm,let me reconsider,on second thought,not correct, orthis is wrong\.

Table[5](https://arxiv.org/html/2605.24202#A1.T5)collects the three token\-distribution\-asymmetry cells\. Panels A1 and A2 report the two Eval\-OptSPcells, where the per\-role gradient asymmetry surfaces as the evaluator emitting a Python solution block on Code \(Panel A1\) or growing a long re\-solve derivation on Math \(Panel A2\)\. In both cases the evaluator contributes the dominant share of per\-step gradient mass on the shared policy and pulls the captured slot toward the dominant role’s idiom\. Panel A3 reports the VotingSPcellVoting\-SP\-4B\-Math, which fires the same token\-distribution\-asymmetry surface on the aggregator slot: the role designed to emit a terse\\boxed\{N\}selection stamp migrates over training toward verbose justification text\.

Token\-level training metrics on the shared policy remain near their first logged values on the Voting cell because the cross\-role anchor against the voter slots suppresses the training\-side signature, while the trajectory\-level signature on the aggregator slot accumulates over training\. The slot tasked with scoring the work starts producing the work itself, identical in shape to the long\-form re\-solve manifestation onEval\-Opt\-SP\-1\.7B\-Math\(Panel A2\)\.

Table[6](https://arxiv.org/html/2605.24202#A1.T6)reports the Orch\-WorkersSPcell, where the per\-role gradient asymmetry instead surfaces as a per\-episode frequency asymmetry: the worker role occupies three of the five episode slots and so contributes the dominant share of per\-step gradient mass without any per\-rollout token\-distribution asymmetry\.

Table 5:Per\-role trajectory\-level signatures of token\-distribution asymmetry on the threeSPcells named in §[5\.2](https://arxiv.org/html/2605.24202#S5.SS2)\. Panel A1 \(Eval\-Opt\-SP\-0\.6B\-Code\) classifies the evaluator’s iter\-1 output intopython\_code\_fence\(a Python solution block of≥3\\geq 3body lines inside a fenced block\),bare\_stamp\(a≤200\\leq 200\-token\\boxed\{Correct/Incorrect\}verdict with no fenced block\), orother, and reports the parsed\\boxed\{Correct\}\-versus\-unparseable rates for the verdict slot alongside the iter\-1 generator truncation rate against the 2048\-token cap\. Panel A2 \(Eval\-Opt\-SP\-1\.7B\-Math\) reports evaluator and generator token\-length percentiles and truncation rates against the 5120\-token cap at three checkpoints; the generator’s\\boxed\{\}retention rate is the rate at which the iter\-1 generator response contains a parseable\\boxed\{\}, and the evaluator’s verdict\-tag retention rate is the rate at which the evaluator output contains a parseable\\boxed\{Correct/Incorrect\}\. Panel A3 \(Voting\-SP\-4B\-Math\) reports aggregator and voter token\-length percentiles; theterse rateis the fraction of aggregator turns at≤30\\leq 30tokens containing a parseable\\boxed\{\}, and voter columns are averages across the three voter slots per problem\.Panel A1\.Eval\-Opt\-SP\-0\.6B\-Code\(deepcoder\_primeintellect, 100 problems / step\)\.

Panel A2\.Eval\-Opt\-SP\-1\.7B\-Math\(dapo\_math, 100 problems / step\)\.

Panel A3\.Voting\-SP\-4B\-Math\(dapo\_math, 100 problems / step; 3 voters and 1 aggregator per problem\)\.

Table 6:Per\-role trajectory\-level signatures of worker\-shape capture onOrch\-Workers\-SP\-4B\-Math, named in §[5\.2](https://arxiv.org/html/2605.24202#S5.SS2)\. Each row is one trajectory checkpoint with 100 dapo\_math problems; each problem yields three worker rollouts, one orchestrator rollout, and one synthesizer rollout\. The worker block reports the worker rollout’s mean completion length, the rate at which the rollout hits the 5120\-token cap \(finish\_reason==length\), the rate at which the rollout contains a parseable\\boxed\{\}, the hedging\-phrase rate as defined in §[A\.4](https://arxiv.org/html/2605.24202#A1.SS4), and pairwise inter\-worker 3\-gram Jaccard averaged across the three worker rollouts within a problem and then across problems\. The orchestrator block reports the count of unique 3\-word openers across the 100 episodes \(a strategy\-diversity proxy\)\. The synthesizer block reports the p50 and p95 completion length of the synthesizer rollout; the gap between the two columns captures the bimodalization of synthesizer length\.Cell:Orch\-Workers\-SP\-4B\-Math\(dapo\_math, 100 problems / step, 3 workers / 1 orch / 1 synth per problem\)\.

### A\.5Hyperparameters

Table[7](https://arxiv.org/html/2605.24202#A1.T7)lists the training hyperparameters used across every cell of the workflow×\\timesscale×\\timestask×\\timespolicy matrix\. Values that vary with workflow, task, or policy routing are reported with their per\-cell setting; the remaining values are fixed across the matrix\. The base model is Qwen3 \(Qwen3\-0\.6B, Qwen3\-1\.7B, or Qwen3\-4B\); LoRA adapters are attached to every linear module of the base model\.Isolated\-Policyattaches one adapter per role type \(so the three voting generators share a single generator adapter, and the three Orchestrator\-Workers workers share a single worker adapter\)\.Shared\-Policyattaches one adapter shared across every role in the workflow\.

Table 7:Training hyperparameters used across the workflow×\\timesscale×\\timestask×\\timespolicy matrix\. Values listed without per\-cell variation are fixed across the matrix\. Where a value depends on the workflow, task, or policy\-routing strategy, the per\-cell setting is given\.GroupHyperparameterValueOptimizationOptimizerAdamWLearning rate2×10−52\\times 10^\{\-5\}Warmup stylecosineWarmup steps15Gradient clipping1\.0PPO clip ratio \(high\)0\.28LoRA adapterRankrr64Alphaα\\alpha32Dropout0\.0Target modulesall linear modulesGRPOAlgorithmGRPO \(no explicit KL\)Group sizenn8Advantage normalizationgroup\-relativePPO mini\-batch size64PPO epochs per update1Loss aggregationsequence\-mean, token\-meanBatchingTrain batch size \(problems / step\)64Validation batch size2048Rollout temperature0\.7Sequence lengthEval\-Opt, Math \(prompt / response\)30720 / 5120Eval\-Opt, Code \(prompt / response\)10240 / 2048Voting, Math \(prompt / response\)20480 / 5120Voting, Code \(prompt / response\)10240 / 2048Orch\-Workers, Math \(prompt / response\)20480 / 5120Orch\-Workers, Code \(prompt / response\)10240 / 2048Workflow capsEval\-Opt revision rounds, Math3Eval\-Opt revision rounds, Code2Voting candidate generations per problem3Orch\-Workers worker proposals per problem3Training scheduleTotal training steps500Validation interval \(steps\)10Checkpoint interval \(steps\)5Policy routingIsolated\-Policyone LoRA adapter per role typeShared\-Policyone shared LoRA adapter across all roles

### A\.6Discussion

#### A\.6\.1Design Implications

The choice betweenShared\-PolicyandIsolated\-Policytraining is workflow\- and task\-conditional\. Each routes training pressure through different channels, and the right choice depends on the workflow, the task, and which role within the workflow is most fragile\.

Isolated\-Policytraining is the appropriate choice when role specialization is itself valuable and when the role that carries same\-role multiplicity within an episode is not the role most prone to collapse\. In that regime,IPpreserves role\-distinguishing parameters and lets each role pursue its own reward geometry\. When the role with same\-role multiplicity is also the one most prone to collapse, however, parallel same\-role rollouts on the same prompt can drive a coherent gradient direction on that role and accelerate its drift\. We therefore recommend auditing which role within a workflow carries multiplicity before defaulting toIPtraining\.

Shared\-Policytraining is the appropriate choice when cross\-role parameter coupling is acceptable, that is, when a single policy answering for multiple roles does not violate task semantics\. Shared parameters can suppress the same\-role amplification thatIPtraining admits, but they introduce their own failure surface\. Under workflows with asymmetric per\-step gradient contribution across roles, the shared policy can be captured on the dominant role’s slot, expressed as the dominant role’s distribution leaking into the captured role’s outputs\. Under Voting workflows, this surfaces as capture of the aggregator’s terse\-selection slot by the voters’ long\-form idiom, with the captured\-mode signature visible in the aggregator’s emitted\-length distribution while voter\-side training metrics remain near their first logged values\.

Across the workflow, scale, task, and policy\-sharing matrix studied here, workflow choice and task fit account for more variance in training stability than the policy\-sharing axis does\. Policy sharing is one auditable design choice among several; the larger structural choices \(which workflow, which task, which roles to compose\) should be made first\.

#### A\.6\.2Monitoring Recommendations

Aggregate metrics miss role drift\. Final accuracy and global entropy can remain healthy while a single role’s parameters drift, or while the shared policy is captured by a dominant role on the captured role’s slot\. Practitioners training multi\-agent RL workflows should therefore monitor three signals beyond aggregate accuracy, each of which fires on a specific failure route from §[5](https://arxiv.org/html/2605.24202#S5)\.

1. 1\.Per\-role training metrics, not their aggregate\.Per\-role perplexity, gradient norm, KL divergence to the reference distribution, and token\-level concentration each fire on a different drift signature, and the role with same\-role multiplicity is the one to watch most closely\. Sharp per\-role perplexity rise on one role while aggregate accuracy is stable is the training\-side fingerprint ofgradient\_amplificationunderIPtraining\.
2. 2\.Per\-role trajectory inspection at training\-relevant checkpoints\.Aggregate scores cannot distinguish a role that has lost its template from one that is solving the task differently\. Reading role outputs at a small number of well\-chosen steps surfaces the qualitative signatures ofsp\_role\_captureon its token\-distribution surfaces \(code emission in the evaluator slot under Code, length\-inflated re\-solving in the evaluator slot under Math\) and of cross\-role bleed\-over more generally; no scalar metric reports them\.
3. 3\.Per\-role response shape on the aggregator slot of Voting workflows\.The aggregator slot is designed to emit a terse selection stamp over the voter responses; a VotingSPcell whose aggregator slot migrates toward verbose justification text shaped like the voter responses is one in which the shared policy has been captured by the dominant voter idiom\. The diagnostic is the aggregator’s emitted\-length distribution \(terse\-rate, p50 and p95 of emitted characters\) tracked across training, contrasted against the per\-role response length on the voter slots\. Mean accuracy hides this until the validation cliff appears\.

### A\.7Reward Functions

##### Math\.

A rollout’s outcome reward is computed by parsing the final answer in the trajectory’s terminal\\boxed\{\}\\backslash\\texttt\{boxed\\\{\\\}\}expression and comparing it to the ground\-truth final answer recorded in the DAPO\-Math\-17K dataset\. A correctly parsed and matching answer receives a reward of11; a parsed but incorrect answer receives a reward of0; a missing or malformed boxed answer receives a small format\-error penalty\.

##### Code\.

A rollout’s outcome reward is the unit\-test pass rate of the trajectory’s terminal Python code block evaluated against the problem’s hidden test suite\. Code that fails to parse, fails to compile, or hits the per\-test execution timeout receives a reward of0; otherwise the reward is the fraction of tests that pass\.

##### Per\-role advantage assignment\.

A single outcome reward is computed for the full workflow rollout, and that reward is propagated uniformly to every token emitted by every role in the rollout when computing per\-role advantages\. No per\-step or per\-role process reward is used in any of the main experiments\. Within each role’s update, gradients are accumulated only over tokens emitted by that role; tokens emitted by other roles in the same rollout are masked out of the role’s loss\. UnderShared\-Policy, all roles’ tokens contribute to the gradient of a single shared adapter; underIsolated\-Policy, each role’s tokens contribute only to the gradient of its own adapter\.

### A\.8Compute

All training runs use a single compute node with two GPUs and a frozen base model with LoRA adapters; the base model parameters are not updated, so memory and bandwidth are dominated by adapter states, optimizer state for the adapters, and the vLLM rollout cache\.

Our training and rollout stack holds multiple LoRA adapters resident in GPU memory simultaneously and selects the active adapter per role at the granularity of individual rollout requests and per\-step gradient updates\. Under IP, each role binds to its own adapter; under SP, all roles share a single adapter\. Adapter selection is a low\-overhead pointer swap rather than a re\-load, and the adapter rank is small relative to the frozen base, so the additional adapters used by IP add a negligible increment to GPU memory, optimizer state, and step time\. As a result, IP and SP runs have effectively identical wall\-clock and memory cost at every scale we study, and the IP versus SP comparison is not confounded by parameter count or compute budget\.

Training uses Fully Sharded Data Parallel \(FSDP\) sharding across the two GPUs for the base model and adapter parameters, and a vLLM\-backed rollout engine for asynchronous generation; rollout and training share the GPUs and run alternately\. Most runs use one of two GPU classes: an H100 \(80 GB\) two\-GPU node, used predominantly for the 1\.7B and 4B cells, and an L40s \(48 GB\) two\-GPU node, used predominantly for the 0\.6B cells and as a fallback for 1\.7B cells\. The maximum number of tokens permitted in a single PPO microbatch on each GPU is set per \(scale, GPU class\) pair so that long Math rollouts fit on the smaller\-memory class without out\-of\-memory errors\. Total compute across the matrix is approximately 235 days of aggregate wall\-clock as reported by the wandb*Total compute*field summed over all runs\.
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Similar Articles

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Reward-Driven LLM Agent Workflows: Synthesizing POMDP Routing and Self-Correction for Autonomous Decision-Making

@rohanpaul_ai: New Stanford paper argues that, under equal reasoning budgets, one LLM usually solves multi-hop problems better than ma…

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

Submit Feedback

Similar Articles

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
Reward-Driven LLM Agent Workflows: Synthesizing POMDP Routing and Self-Correction for Autonomous Decision-Making
@rohanpaul_ai: New Stanford paper argues that, under equal reasoning budgets, one LLM usually solves multi-hop problems better than ma…
AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs