Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems

arXiv cs.CL 05/18/26, 04:00 AM Papers
Summary
Introduces Nexa, a trainable response-conditioned policy that combines parallel and sequential execution in multi-agent systems, using a lightweight transformer to predict sparse communication graphs, improving accuracy while minimizing latency.
arXiv:2605.15573v1 Announce Type: new Abstract: Multi-agent systems can solve complex tasks through collaboration between multiple Large Language Model agents. Existing collaboration frameworks typically operate in either a parallel or a sequential mode. In the parallel mode, agents respond independently to queries followed by aggregation of responses. In contrast, sequential systems allow agents to communicate via a directed topology and refine one another step by step. However, both modes are inadequate for achieving the desired objectives of minimizing communication and latency while simultaneously maximizing the accuracy of the final response. In this work, we introduce a hybrid paradigm called Nexa, a trainable response-conditioned policy that bridges the gap between the two modes. Nexa begins with a parallel execution stage, embeds the resulting responses into a shared semantic space, and then predicts a sparse directed acyclic communication graph. If the graph is empty, the system remains purely parallel; if it is non-empty, the system performs one sequential message propagation. The policy is a lightweight transformer model, and the method avoids the need for external LLM judges or reward models, as well as hand-crafted test-time topology search. We formalize this hybrid execution problem, show that the resulting graph is acyclic by construction, and that the framework strictly subsumes pure parallel execution, and present a training procedure based on policy-gradient optimization. Results demonstrate that the response-conditioned policy learned by Nexa under one setting can be reused when the number of agents, the task, or the underlying agent changes, thus emphasizing the generalizability of the learned communication policy.
Original Article
View Cached Full Text
Cached at: 05/18/26, 06:32 AM
# Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems
Source: [https://arxiv.org/html/2605.15573](https://arxiv.org/html/2605.15573)
Nurbek Tastan1,2Alex Iacob2,3Lorenzo Sani2,3Meghdad Kurmanji2 Nicholas D\. Lane2,3Samuel Horváth1Karthik Nandakumar1,4 1MBZUAI, UAE2University of Cambridge, UK 3Flower Labs, UK4Michigan State University, USA

###### Abstract

Multi\-agent systems can solve complex tasks through collaboration between multiple Large Language Model agents\. Existing collaboration frameworks typically operate in either a parallel or a sequential mode\. In the parallel mode, agents respond independently to queries followed by aggregation of responses\. In contrast, sequential systems allow agents to communicate via a directed topology and refine one another step by step\. However, both modes are inadequate for achieving the desired objectives of minimizing communication and latency while simultaneously maximizing the accuracy of the final response\. In this work, we introduce a hybrid paradigm calledNexa, a trainable response\-conditioned policy that bridges the gap between the two modes\.Nexabegins with a parallel execution stage, embeds the resulting responses into a shared semantic space, and then predicts a sparse directed acyclic communication graph\. If the graph is empty, the system remains purely parallel; if it is non\-empty, the system performs one sequential message propagation\. The policy is a lightweight transformer model, and the method avoids the need for external LLM judges or reward models, as well as hand\-crafted test\-time topology search\. We formalize this hybrid execution problem, show that the resulting graph is acyclic by construction, and that the framework strictly subsumes pure parallel execution, and present a training procedure based on policy\-gradient optimization\. Results demonstrate that the response\-conditioned policy learned byNexaunder one setting can be reused when the number of agents, the task, or the underlying agent changes, thus emphasizing the generalizability of the learned communication policy\.

## 1Introduction

Large language models \(LLMs\) have become increasingly capable at reasoning, coding, planning, and dialogue, yet a single model still suffers from stochastic failures, brittle long\-horizon reasoning, and occasional hallucinations\. Multi\-agent systems aim to address these weaknesses by distributing problem solving across multiple agents whose outputs can complement, critique, or refine one another\. The central question of such systems ishow that collaboration should be orchestrated\.

Existing LLM\-based multi\-agent systems largely fall into two categories\. Inparallelsystems, agents answer independently, and their outputs are combined by majority voting, self\-consistency, or a learned aggregation rule\(Wanget al\.,[2023](https://arxiv.org/html/2605.15573#bib.bib185); Jianget al\.,[2023](https://arxiv.org/html/2605.15573#bib.bib159)\)\. Insequentialsystems, agents are arranged in a communication topology, often a chain, tree, or a more general graph, and information is propagated step by step\(Zhugeet al\.,[2024](https://arxiv.org/html/2605.15573#bib.bib2); Qianet al\.,[2025](https://arxiv.org/html/2605.15573#bib.bib11)\)\. Parallel systems are simple and scalable, but they are computationally expensive, token\-intensive, and often redundant, requiring multiple rounds of parallel message propagation while still being unable to exploit targeted communication when one draft could help repair another\. Sequential systems can support error correction and information flow, but they require a topology and therefore inherit the burden of deciding who should communicate with whom\. Prior work has explored fixed topologies, policy\-gradient optimization over edges, graph generators conditioned on tasks or roles, and judge\-based routing, each adding substantial token, compute, optimization, coordination overhead, or reducing transferability across settings\(Qianet al\.,[2025](https://arxiv.org/html/2605.15573#bib.bib11); Zhugeet al\.,[2024](https://arxiv.org/html/2605.15573#bib.bib2); Zhanget al\.,[2025b](https://arxiv.org/html/2605.15573#bib.bib4)\)\.

These two paradigms are often treated as separate design choices\. A system is either built as a parallel ensemble or as a sequential graph\-based collaboration mechanism\. Yet, this distinction is too rigid\. In many realistic settings, the right approach is not to commit in advance to a single paradigm, but to start in parallel and then decide, based on the agent’s actual outputs, whether sequential propagation is necessary\. If the initial responses already contain strong agreement and sufficient information, additional communication may be unnecessary\. If they disagree in informative ways, or if useful signals are scattered across agents, then structured propagation may help\. This suggests that the real problem is not “parallel or sequential” in the abstract, but rather:

Given the current pool of agent responses, should the system remain in the parallel regime, or should it instantiate a communication graph and perform sequential refinement?

To answer this question, we introduceNexa\(from “nexus”, a connection or link\), a trainable policy for*communication graph prediction*in multi\-agent LLM systems\.Nexabegins with a parallel draft stage in which all agents answer independently\. The resulting response pool is embedded into a shared semantic space, producing a compact representation of the current response state of the team\. A lightweight transformer\-based policy then predicts a sparse directed acyclic graph \(DAG\)\. If the graph is empty, the system remains in the parallel regime and returns the parallel aggregate\. If the graph is non\-empty, the system executes one sequential consolidation pass in which selected agents update their responses using information from upstream nodes\.

This formulation deliberately treats parallel and sequential execution not as mutually exclusive system designs, but as two outcomes of the same learned policy\. In this sense, the central contribution ofNexais not merely graph prediction\. It is a mechanism for*bridging the gap between parallel execution and sequential execution*by using parallel drafts to decide whether structured propagation is needed and, if so, how it should proceed\.

A second principle of the method is simplicity\. We do*not*learn the topological order\. Instead, we induce the order from agent contributions, retaining the most stable organizing principle of response\-conditioned communication\. The policy learns only the communication edges\. We score the candidate communication edges using the affinity matrix formed from transformer\-contextualized response representations\. This makes the policy lightweight and keeps the graph decoder tightly coupled to the semantic interactions encoded by the backbone\.

Nexais also designed to be agnostic to superficial configuration details\. The policy consumes semantic representations of agent outputs rather than role labels, agent identities, or model\-family indicators\. As a consequence, the planner is structurally insensitive to which agent is called “Programmer” or “Assistant”; what matters is what the agents actually say\. This does*not*by itself guarantee transfer across all tasks or backbones, and we explicitly treat that as an empirical question\. But it does mean that the policy class is not intrinsically tied to a fixed role inventory or a single team structure\.

The paper makes four contributions\. First, it formalizes a hybrid decision problem in which a learned communication graph determines whether a multi\-agent system remains parallel or enters a sequential propagation regime\. Second, it proposes a contribution\-ordered, attention\-based graph policy that predicts only the communication edges, keeping the controller simple and acyclic by construction\. Third, it integrates the key theoretical properties of the method directly into the formulation: DAG validity, hybrid subsumption, and permutation\-based identity agnosticism\. Fourth, it empirically evaluatesNexaacross reasoning and programming tasks, showing improved accuracy\-cost tradeoffs, sparse communication behavior, and transfer across agent counts, tasks, model scales, and generations\.

## 2Problem Formulation and Preliminaries

Let𝒜∈\{𝒜1,…,𝒜N\}\\mathcal\{A\}\\in\\\{\\mathcal\{A\}\_\{1\},\\ldots,\\mathcal\{A\}\_\{N\}\\\}be a set ofNNagents, and let𝒬\\mathcal\{Q\}be a user query\. Each agent may differ in prompt, role, or backbone model, but the communication policy introduced in this work does not rely on these identities explicitly\. Instead, it operates on the semantic content of the agents’ responses\.

Given the query, each agent independently produces an initial response

ℛn\(0\)=𝒜n\(𝒬\),n∈\{1,2,…,N\}\.\{\\mathcal\{R\}\}\_\{n\}^\{\(0\)\}=\{\\mathcal\{A\}\}\_\{n\}\(\{\\mathcal\{Q\}\}\),\\qquad n\\in\\\{1,2,\\ldots,N\\\}\.\(1\)The first phase is fully parallel and produces a draft response setℛ\(0\)=\{ℛ1\(0\),…,ℛN\(0\)\}\{\\mathcal\{R\}\}^\{\(0\)\}=\\\{\{\\mathcal\{R\}\}\_\{1\}^\{\(0\)\},\\ldots,\{\\mathcal\{R\}\}\_\{N\}^\{\(0\)\}\\\}\.

The purpose of this is twofold\. First, it provides diverse candidate solutions to the query\. Second, and more importantly for our setting, it exposes thecurrent response stateof the multi\-agent system\. Since LLM outputs are inherently stochastic, this realized state is more informative for downstream coordination than static task labels or role descriptions\. This response\-conditioned perspective is central to the present work and follows the same foundational motivation that underlies SelfOrg\(Tastanet al\.,[2026](https://arxiv.org/html/2605.15573#bib.bib1)\)\.

To reason about relations among agent outputs, we map each response into a shared semantic embedding space using a fixed lightweight encoderff\(all\-MiniLM\-L6\-v2\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.15573#bib.bib20)\)\):rn=f\(ℛn\(0\)\)∈ℝd\.r\_\{n\}=f\(\{\\mathcal\{R\}\}\_\{n\}^\{\(0\)\}\)\\in\\mathbb\{R\}^\{d\}\.

Following SelfOrg\(Tastanet al\.,[2026](https://arxiv.org/html/2605.15573#bib.bib1)\), we define the average response embeddingravg=1N∑n=1Nrnr\_\{\\text\{avg\}\}=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}r\_\{n\}and contribution scoresψn=cos⁡\(rn,ravg\)\.\\psi\_\{n\}=\\cos\(r\_\{n\},r\_\{\\text\{avg\}\}\)\.SelfOrg motivatesψn\\psi\_\{n\}as a linear\-time approximation to a Shapley\-style contribution value\(Shapley,[1953](https://arxiv.org/html/2605.15573#bib.bib24)\)and shows that, under suitable separation conditions, ranking byψn\\psi\_\{n\}preserves the normalized Shapley ordering\. This is precisely whyNexauses contribution to define the topological ordering of the edges\.

The orchestration problem is to predict a directed communication graph𝒢=\(𝒱,ℰ,π\),\{\\mathcal\{G\}\}=\(\{\\mathcal\{V\}\},\{\\mathcal\{E\}\},\\pi\),whereV=\{1,…,N\},E⊆V×V,V=\\\{1,\\ldots,N\\\},E\\subseteq V\\times V,andπ\\piis an order over the nodes\. If the graph is empty \(ℰ=∅\{\\mathcal\{E\}\}=\\varnothing\), the system stays in the parallel regime and outputs an aggregate of the initial drafts\. If it is non\-empty \(ℰ≠∅\{\\mathcal\{E\}\}\\neq\\varnothing\), the graph induces a sequential propagation step\.

For each nodenn, define its parent setPa⁡\(n\)=\{m:\(m→n\)∈ℰ\}\.\\operatorname\{Pa\}\(n\)=\\\{m:\(m\\to n\)\\in\{\\mathcal\{E\}\}\\\}\.Then the updated response is

ℛn\(1\)=\{𝒜n\(𝒬,\{ℛm\(⋆\):m∈Pa⁡\(n\)\}\),Pa⁡\(n\)≠∅,ℛn\(0\),Pa⁡\(n\)=∅,\{\\mathcal\{R\}\}\_\{n\}^\{\(1\)\}=\\begin\{cases\}\{\\mathcal\{A\}\}\_\{n\}\(\{\\mathcal\{Q\}\},\\\{\{\\mathcal\{R\}\}\_\{m\}^\{\(\\star\)\}:m\\in\\operatorname\{Pa\}\(n\)\\\}\),&\\operatorname\{Pa\}\(n\)\\neq\\varnothing,\\\\ \{\\mathcal\{R\}\}\_\{n\}^\{\(0\)\},&\\operatorname\{Pa\}\(n\)=\\varnothing,\\end\{cases\}\(2\)whereℛm\(⋆\)\{\\mathcal\{R\}\}\_\{m\}^\{\(\\star\)\}denotes the most recent available parent response under the topological execution order\.

The final answer is selected from the resulting response pool using a judge\-free aggregation rule\. Let\{𝒛n\}\\\{\{\\bm\{z\}\}\_\{n\}\\\}be the final response embeddings and\{wn\}\\\{w\_\{n\}\\\}their contribution weights\. We compute

zcentroid=∑n=1Nwn𝒛n∑n=1Nwn,n⋆=arg⁡maxn⁡cos⁡\(𝒛n,𝒛centroid\)\.z\_\{\\text\{centroid\}\}=\\frac\{\\sum\_\{n=1\}^\{N\}w\_\{n\}\{\\bm\{z\}\}\_\{n\}\}\{\\sum\_\{n=1\}^\{N\}w\_\{n\}\},\\qquad n^\{\\star\}=\\arg\\max\_\{n\}\\cos\(\{\\bm\{z\}\}\_\{n\},\{\\bm\{z\}\}\_\{\\text\{centroid\}\}\)\.\(3\)and return the corresponding response\.

The learning objective is to maximize final task correctness\. Given ground truthyyand final predictiony^𝒢\\hat\{y\}\_\{\{\\mathcal\{G\}\}\}under graph𝒢\{\\mathcal\{G\}\}, the reward is

R\(𝒢\)=𝟙\[Eval⁡\(y^𝒢,y\)=1\]\.R\(\{\\mathcal\{G\}\}\)=\\mathds\{1\}\\left\[\\operatorname\{Eval\}\(\\hat\{y\}\_\{\{\\mathcal\{G\}\}\},y\)=1\\right\]\.\(4\)The policy, therefore, learns to predict a communication graph that determines whether the initial parallel responses should remain as they are or be further refined through structured propagation\.

## 3Methodology

### 3\.1System Overview

Nexaconsists of five stages\. First, all agents produce draft responses in parallel\. Second, those responses are embedded into a shared semantic space\. Third, a response\-conditioned transformer policy predicts a sparse communication graph\. Fourth, if the graph is non\-empty, the corresponding destination nodes are updated sequentially\. Fifth, the final answer is selected from the resulting response pool by weighted\-centroid\-based aggregation\.

This design has a central conceptual consequence: parallel execution is not discarded when sequential communication is introduced\. Instead, the parallel draft stage becomes the*source of evidence*that determines whether the system should remain in the parallel regime or transition into a sequential propagation regime\.

### 3\.2Contribution\-Defined Order and DAG Validity

We set the topological order asπ=argsort⁡\(ψ1,…,ψN;ψk≥ψk\+1,∀k∈\[N\]\)\.\\pi=\\operatorname\{argsort\}\(\\psi\_\{1\},\\dots,\\psi\_\{N\};\\psi\_\{k\}\\geq\\psi\_\{k\+1\},\\forall k\\in\[N\]\)\.

In other words, higher\-contribution agents are always placed earlier in the communication order\. The feasible edge set is therefore restricted to

ℰπ=\{\(m,n\):π−1\(m\)<π−1\(n\)\},\{\\mathcal\{E\}\}\_\{\\pi\}=\\\{\(m,n\):\\pi^\{\-1\}\(m\)<\\pi^\{\-1\}\(n\)\\\},\(5\)so that communication is only allowed to move forward under the contribution order\.

###### Proposition 1\(Acyclicity by construction\)\.

For any edge setℰ⊆ℰπ\{\\mathcal\{E\}\}\\subseteq\{\\mathcal\{E\}\}\_\{\\pi\}, the graph𝒢=\(𝒱,ℰ,π\)\{\\mathcal\{G\}\}=\(\{\\mathcal\{V\}\},\{\\mathcal\{E\}\},\\pi\)is a directed acyclic graph\.

###### Proof\.

Assume for contradiction that𝒢\{\\mathcal\{G\}\}contains a directed cycle

v1→v2→⋯→vK→v1\.v\_\{1\}\\to v\_\{2\}\\to\\cdots\\to v\_\{K\}\\to v\_\{1\}\.\(6\)
Because every edge must go forward underπ\\pi, we must simultaneously have

π−1\(v1\)<π−1\(v2\)<⋯<π−1\(vK\)<π−1\(v1\),\\pi^\{\-1\}\(v\_\{1\}\)<\\pi^\{\-1\}\(v\_\{2\}\)<\\cdots<\\pi^\{\-1\}\(v\_\{K\}\)<\\pi^\{\-1\}\(v\_\{1\}\),\(7\)which is impossible\. Hence, no directed cycle can exist\. ∎

This parameterization is simpler than detecting and repairing cycles after graph prediction because DAG validity is built directly into the action space of the policy\.

### 3\.3Response\-Conditioned Graph Policy

The graph policy consumes only the current response set, not agent identities, role labels, or model\-family indicators\. Let

𝒳=\[r1,…,rN\]⊤∈ℝN×d\.\{\\mathcal\{X\}\}=\[r\_\{1\},\\dots,r\_\{N\}\]^\{\\top\}\\in\\mathbb\{R\}^\{N\\times d\}\.\(8\)A transformer encoder\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.15573#bib.bib188)\)Encθ\\operatorname\{Enc\}\_\{\\theta\}maps the response set to contextualized node states

ℋ=Encθ⁡\(X\)=\[𝒉1,…,𝒉N\]⊤,𝒉n∈ℝdh\.\{\\mathcal\{H\}\}=\\operatorname\{Enc\}\_\{\\theta\}\(X\)=\[\{\\bm\{h\}\}\_\{1\},\\dots,\{\\bm\{h\}\}\_\{N\}\]^\{\\top\},\\qquad\{\\bm\{h\}\}\_\{n\}\\in\\mathbb\{R\}^\{d\_\{h\}\}\.\(9\)
Because the encoder operates on the response embeddings without identity\-specific tokens, the policy is permutation\-equivariant over the agent dimension\.

###### Proposition 2\(Permutation\-based identity agnosticism\)\.

Assume that the response encoderffis applied independently to each response and thatEncθ\\operatorname\{Enc\}\_\{\\theta\}is permutation\-equivariant\. Then, for any permutation matrixPP,

Encθ⁡\(P𝒳\)=PEncθ⁡\(𝒳\),\\operatorname\{Enc\}\_\{\\theta\}\(P\{\\mathcal\{X\}\}\)=P\\operatorname\{Enc\}\_\{\\theta\}\(\{\\mathcal\{X\}\}\),\(10\)the induced graph distribution is equivariant to any relabeling of agents\.

###### Proof\.

SinceEncθ\\operatorname\{Enc\}\_\{\\theta\}\(transformer\) without positional encodings is permutation\-equivariant, permuting agent indices permutes𝒳\{\\mathcal\{X\}\}and thusEncθ⁡\(𝒳\)\\operatorname\{Enc\}\_\{\\theta\}\(\{\\mathcal\{X\}\}\); the remaining steps \(cosine\-to\-mean scoring, ordering, and edge construction\) are permutation\-consistent, so the graph distribution is equivariant\. ∎

We then predict communication edges directly from the globally contextualized hidden states\. Concretely, we form a response\-response score matrix from the contextualized states:

Λ=ℋℋ⊤\.\\Lambda=\{\\mathcal\{H\}\}\{\\mathcal\{H\}\}^\{\\top\}\.\(11\)
Here,Λ\\Lambdais not the final adjacency matrix; it provides edge logits that are passed through a sigmoid and then sampled to obtain the communication graph\. This construction is deliberate\. The hidden states inℋ\{\\mathcal\{H\}\}are already globally contextualized, so the resulting edge logits are informed by the entire response set rather than by isolated response pairs\. In this way, the communication graph is read directly from the shared semantic structure induced by the encoder\.

### 3\.4Response Propagation and Aggregation

Nexadoes not require a separate stop network or node\-activation network\. The graph itself determines both whether sequential communication occurs and which nodes are updated\. A node is updated if and only if it has at least one incoming edge:

un=𝟏\[∑m=1N𝟏\[\(m→n\)∈ℰ\]\>0\]\.u\_\{n\}=\\bm\{1\}\\left\[\\sum\_\{m=1\}^\{N\}\\bm\{1\}\[\(m\\to n\)\\in\{\\mathcal\{E\}\}\]\>0\\right\]\.\(12\)Ifℰ=∅\{\\mathcal\{E\}\}=\\varnothing, thenun=0u\_\{n\}=0for all nodes, no additional calls are made, and the system returns the parallel aggregate\. Ifℰ≠∅\{\\mathcal\{E\}\}\\neq\\varnothing, the graph induces one sequential consolidation pass\.

###### Proposition 3\(Hybrid subsumption\)\.

The policy class ofNexastrictly subsumes the pure parallel regime\.

###### Proof\.

The empty graphℰ=∅\{\\mathcal\{E\}\}=\\varnothingis always attainable \(all edge probabilities zero/small\), in which case no updates occur, and the method reduces to pure parallel execution with aggregation\. Anyℰ≠∅\{\\mathcal\{E\}\}\\neq\\varnothinginduces at least one sequential update, so the policy class strictly contains the parallel regime\. ∎

Whenℰ≠∅\{\\mathcal\{E\}\}\\neq\\varnothing, sequential propagation follows the contribution orderπ\\pi\. For nodenn, define

Pa⁡\(n\)=\{m:\(m→n\)∈ℰ\}\.\\operatorname\{Pa\}\(n\)=\\\{m:\(m\\to n\)\\in\{\\mathcal\{E\}\}\\\}\.\(13\)The updated response is then

ℛn\(1\)=\{𝒜n\(𝒬,\{ℛm\(⋆\):m∈Pa⁡\(n\)\}\),Pa⁡\(n\)≠∅,ℛn\(0\),Pa⁡\(n\)=∅,\{\\mathcal\{R\}\}\_\{n\}^\{\(1\)\}=\\begin\{cases\}\{\\mathcal\{A\}\}\_\{n\}\\\!\\left\(\{\\mathcal\{Q\}\},\\\{\{\\mathcal\{R\}\}\_\{m\}^\{\(\\star\)\}:m\\in\\operatorname\{Pa\}\(n\)\\\}\\right\),&\\operatorname\{Pa\}\(n\)\\neq\\varnothing,\\\\ \{\\mathcal\{R\}\}\_\{n\}^\{\(0\)\},&\\operatorname\{Pa\}\(n\)=\\varnothing,\\end\{cases\}\(14\)whereℛm\(⋆\)\{\\mathcal\{R\}\}\_\{m\}^\{\(\\star\)\}denotes the most recent available parent response under the topological execution order\. Because all edges go forward underπ\\pi, each parent response is available when a destination node is updated\.

After either staying in the parallel regime or completing one propagation pass,Nexaselects the final answer without using an external judge\. Letℛ~n\\tilde\{\{\\mathcal\{R\}\}\}\_\{n\}denote the final candidate response for agentnnand let

zn=f\(ℛ~n\),zavg=1N∑n=1Nzn,wn=cos⁡\(zn,zavg\)\.z\_\{n\}=f\(\\tilde\{\{\\mathcal\{R\}\}\}\_\{n\}\),\\qquad z\_\{\\text\{avg\}\}=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}z\_\{n\},\\qquad w\_\{n\}=\\cos\(z\_\{n\},z\_\{\\text\{avg\}\}\)\.\(15\)We then compute the contribution\-weighted centroidzcentroid=∑n=1Nwnzn∑n=1Nwnz\_\{\\text\{centroid\}\}=\\frac\{\\sum\_\{n=1\}^\{N\}w\_\{n\}z\_\{n\}\}\{\\sum\_\{n=1\}^\{N\}w\_\{n\}\}and select

n⋆=arg⁡maxn⁡cos⁡\(zn,zcentroid\),y^=ℛ~n⋆\.n^\{\\star\}=\\arg\\max\_\{n\}\\cos\(z\_\{n\},z\_\{\\text\{centroid\}\}\),\\qquad\\hat\{y\}=\\tilde\{\{\\mathcal\{R\}\}\}\_\{n^\{\\star\}\}\.\(16\)This aggregation rule directly inherits the response\-conditioned, judge\-free philosophy of SelfOrg\(Tastanet al\.,[2026](https://arxiv.org/html/2605.15573#bib.bib1)\)\.

### 3\.5Training Objective

The deployment objective is the final task correctness\. For a labeled example\(𝒬,y\)\(\{\\mathcal\{Q\}\},y\), lety^𝒢\\hat\{y\}\_\{\{\\mathcal\{G\}\}\}denote the final output under graph𝒢\{\\mathcal\{G\}\}\. In the current implementation, correctness is checked with the same verifier used in evaluation, instantiated as anxVerify\-based binary reward\(Chenet al\.,[2025](https://arxiv.org/html/2605.15573#bib.bib183)\)\. We therefore define the task reward

Rtask\(𝒢\)=𝟙\[Eval⁡\(y^𝒢,y\)=1\]\.R\_\{\\mathrm\{task\}\}\(\{\\mathcal\{G\}\}\)=\\mathds\{1\}\\left\[\\operatorname\{Eval\}\(\\hat\{y\}\_\{\{\\mathcal\{G\}\}\},y\)=1\\right\]\.\(17\)Because the orderπ\\piis fixed by the contribution scores, the graph log\-probability decomposes over feasible forward edges:

log⁡pθ\(ℰ∣𝒳,π\)=∑\(m,n\)∈ℰπ\(em→nlog⁡pm→n\+\(1−em→n\)log⁡\(1−pm→n\)\)\.\\log p\_\{\\theta\}\(\{\\mathcal\{E\}\}\\mid\{\\mathcal\{X\}\},\\pi\)=\\sum\_\{\(m,n\)\\in\{\\mathcal\{E\}\}\_\{\\pi\}\}\\Big\(e\_\{m\\to n\}\\log p\_\{m\\to n\}\+\(1\-e\_\{m\\to n\}\)\\log\(1\-p\_\{m\\to n\}\)\\Big\)\.\(18\)The algorithm also applies an explicit sparsity penalty to the sampled graph reward in the same spirit as topology\-economical methods\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.15573#bib.bib3)\)\. LetM=\|ℰπ\|=N\(N−1\)2M=\|\{\\mathcal\{E\}\}\_\{\\pi\}\|=\\frac\{N\(N\-1\)\}\{2\}be the number of feasible forward edges under the contribution\-defined order\. For a sampled graph𝒢\{\\mathcal\{G\}\}, we define the sparsity\-regularized reward

Rsp\(𝒢\)=Rtask\(𝒢\)−λsp\|ℰ\|M,R\_\{\\mathrm\{sp\}\}\(\{\\mathcal\{G\}\}\)=R\_\{\\mathrm\{task\}\}\(\{\\mathcal\{G\}\}\)\-\\lambda\_\{\\mathrm\{sp\}\}\\frac\{\|\{\\mathcal\{E\}\}\|\}\{M\},\(19\)whereλsp≥0\\lambda\_\{\\mathrm\{sp\}\}\\geq 0controls how strongly dense communication graphs are penalized\.

We trainNexawith REINFORCE and a batch\-mean baseline\. For a mini\-batch of sampled graphs\{𝒢\(i\)\}i=1B\\\{\{\\mathcal\{G\}\}^\{\(i\)\}\\\}\_\{i=1\}^\{B\}, we set

b=1B∑i=1BRsp\(𝒢\(i\)\),A\(i\)=Rsp\(𝒢\(i\)\)−b\.b=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}R\_\{\\mathrm\{sp\}\}\(\{\\mathcal\{G\}\}^\{\(i\)\}\),\\qquad A^\{\(i\)\}=R\_\{\\mathrm\{sp\}\}\(\{\\mathcal\{G\}\}^\{\(i\)\}\)\-b\.\(20\)The policy\-gradient term is

ℒ=−1B∑i=1BA\(i\)log⁡pθ\(ℰ\(i\)∣𝒳\(i\),π\(i\)\)\.\\mathcal\{L\}=\-\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}A^\{\(i\)\}\\log p\_\{\\theta\}\(\{\\mathcal\{E\}\}^\{\(i\)\}\\mid\{\\mathcal\{X\}\}^\{\(i\)\},\\pi^\{\(i\)\}\)\.\(21\)Finally, we obtain the final optimization goal: it is REINFORCE with batch\-mean advantage, while sparsity is enforced through the edge\-count penalty in the reward\.

The full procedure is summarized in Algorithm[1](https://arxiv.org/html/2605.15573#alg1)\.

## 4Experiments

The empirical study is designed to demonstrate the generalizability of the learned communication policy\. Rather than only reporting in\-domain performance for the training configuration, we evaluate whether a response\-conditioned policy learned in one setting can be reused when the number of agents, the task, or the underlying agent changes, thus emphasizing training efficiency\.

### 4\.1Experimental Setup

The base training setting uses Qwen2\.5\-1\.5B\-Instruct agents\(Qwenet al\.,[2025](https://arxiv.org/html/2605.15573#bib.bib15)\)on AQUA\-RAT\(Linget al\.,[2017](https://arxiv.org/html/2605.15573#bib.bib177)\)and GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.15573#bib.bib178)\)withN=10N=10agents\. Unless otherwise specified, the policy is trained with REINFORCE, a batch\-mean baseline, batch size3232,5050policy updates, learning rate0\.10\.1, dropout0\.30\.3, and edge\-count sparsity coefficientλsp=0\.1\\lambda\_\{\\mathrm\{sp\}\}=0\.1\. The policy architecture is kept fixed: a one\-layer, one\-head transformer encoder followed by theℋℋ⊤\{\\mathcal\{H\}\}\{\\mathcal\{H\}\}^\{\\top\}edge construction described in Section[3](https://arxiv.org/html/2605.15573#S3)\.

We consider single\-agent system, chain\-of\-thought \(CoT\)\(Weiet al\.,[2022](https://arxiv.org/html/2605.15573#bib.bib18)\), self\-consistency\(Wanget al\.,[2023](https://arxiv.org/html/2605.15573#bib.bib185)\), SelfOrg⋆111SelfOrg⋆indicates SelfOrg with a single sequential communication round\.\(Tastanet al\.,[2026](https://arxiv.org/html/2605.15573#bib.bib1)\), and topology\-learning or pruning methods, including GPTSwarm\(Zhugeet al\.,[2024](https://arxiv.org/html/2605.15573#bib.bib2)\), AgentPrune\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.15573#bib.bib3)\), and G\-Designer\(Zhanget al\.,[2025b](https://arxiv.org/html/2605.15573#bib.bib4)\)as baselines\. While the primary metric is accuracy of the final response, we also report the mean edge count and token consumption usage as proxies for the communication burden and inference cost, respectively\.

### 4\.2Main Results

Table[1](https://arxiv.org/html/2605.15573#S4.T1)reports the main comparison across AQUA\-RAT\(Linget al\.,[2017](https://arxiv.org/html/2605.15573#bib.bib177)\), HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.15573#bib.bib184)\), and GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.15573#bib.bib178)\)\.Nexaachieves the best average accuracy,60\.90%60\.90\\%, improving over SelfOrg⋆while also obtaining the best average rating\. Its gains are strongest on AQUA\-RAT and GSM8K while remaining competitive on HumanEval\.

Table 1:Main comparison across AQUA\-RAT, HumanEval, and GSM8K\. Accuracy is reported as mean±\\pmstd over runs\. Rating uses average rating only\. Token usage is total prompt plus completion tokens\. Lower average rating and lower token usage are better\.![Refer to caption](https://arxiv.org/html/2605.15573v1/x1.png)Figure 1:Accuracy\-cost tradeoff for multi\-agent system baselines across three tasks\. Each point corresponds to one method, with the x\-axis showing total token usage including prompt and completion tokens and the y\-axis showing mean accuracy\.The efficiency results are central to the comparison\.Nexauses18\.3618\.36M total tokens, compared with28\.4128\.41M for SelfOrg∗,38\.0238\.02M for GPTSwarm,31\.4431\.44M for AgentPrune, and31\.8031\.80M for GDesigner\. It therefore reduces token usage by about35%35\\%relative to SelfOrg∗and by more than50%50\\%relative to GPTSwarm, while achieving the highest average accuracy\. Figure[1](https://arxiv.org/html/2605.15573#S4.F1)makes this tradeoff explicit: in the average panel,Nexaoccupies the favorable region of the accuracy\-cost plane, indicating that its improvements are not simply the result of spending more tokens but of selectively invoking communication when the response pool warrants it\.

### 4\.3Generalizability Across Different Axes

#### Number of agents\.

![Refer to caption](https://arxiv.org/html/2605.15573v1/x2.png)Figure 2:Agent\-count transfer forNexa\. The policy is trained withN=10N\{=\}10Qwen2\.5\-1\.5B agents and evaluated without retraining atN∈\{5,…,20\}N\{\\in\}\\\{5,\\ldots,20\\\}\.We first examine generalizability across the number of agents\.Nexais trained withN=10N=10agents and evaluated without retraining forN∈\{5,10,15,20\}N\\in\\\{5,10,15,20\\\}, keeping the task and agent backbone fixed\. This setting tests whether the learned graph policy behaves as a reusable response\-conditioned rule rather than as a memorized topology for a fixed team size\. As shown in Figure[2](https://arxiv.org/html/2605.15573#S4.F2),Nexaremains above the single\-call and chain\-of\-thought baselines\(Weiet al\.,[2022](https://arxiv.org/html/2605.15573#bib.bib18)\)for all tested values ofNNon both AQUA\-RAT and GSM8K\. Accuracy peaks atN=15N=15for both tasks, suggesting that the policy can benefit from additional candidate responses beyond the training configuration while still remaining stable when the team size is smaller or larger thanN=10N=10\.

#### Task transfer\.

We next consider generalizability across tasks while keeping the model family, model size, and training team size fixed\.Nexais trained withN=10N=10Qwen2\.5\-1\.5B agents on either AQUA\-RAT or GSM8K, then evaluated without retraining on both tasks\. Figure[3](https://arxiv.org/html/2605.15573#S4.F3)compares same\-task training against cross\-task training at two tested team sizes,N=5N=5andN=20N=20\. Across all four settings, the transfer gap remains small:0\.180\.18and0\.140\.14points on AQUA\-RAT, and0\.080\.08and0\.050\.05points on GSM8K\. This suggests thatNexamay learn a reusable response\-conditioned rule rather than merely memorizing a task\-specific pattern, although this requires further confirmation under more heterogeneous model families and agent pools\.

![Refer to caption](https://arxiv.org/html/2605.15573v1/x3.png)Figure 3:Task\-transfer comparison forNexaon Qwen2\.5\-1\.5B\.
#### Model scale generalizability\.

We then evaluate whether the learned communication policy transfers across model scales\.Nexais trained using Qwen2\.5\-1\.5B agents and evaluated without retraining on Qwen2\.5\-7B agents, then compared against a policy trained directly with Qwen2\.5\-7B agents\. As shown in Figure[4](https://arxiv.org/html/2605.15573#S4.F4), the 1\.5B\-trained policy closely matches the 7B\-trained policy on both tasks:90\.4890\.48versus90\.5290\.52on GSM8K, and76\.9876\.98versus77\.4077\.40on AQUA\-RAT\. This suggests that the learned graph policy is not tightly coupled to the competence level of the training backbone and can be reused when deployed with a stronger model\.

![Refer to caption](https://arxiv.org/html/2605.15573v1/x4.png)Figure 4:Model\-scale transfer forNexa\.
#### Model generation transfer\.

Finally, we evaluate whether the learned communication policy remains usable when the underlying model is updated to a newer generation\.Nexatrained on Qwen2\.5\-1\.5B is evaluated without retraining on Qwen3\.5\-2B\(Qwen Team,[2026](https://arxiv.org/html/2605.15573#bib.bib189)\)and compared against a policy trained directly on Qwen3\.5\-2B\. AtN=5N=5, the transferred policy reaches77\.4077\.40, compared with77\.7377\.73for the target\-generation policy\. The resulting0\.170\.17\-point gap suggests that an existing policy can remain effective after a model upgrade, reducing the need to retrain the communication controller every time the base model is changed\.

### 4\.4How Communication Changes Answers

![Refer to caption](https://arxiv.org/html/2605.15573v1/x5.png)Figure 5:Model\-generation transfer forNexa\.We further analyze howNexachanges answers after communication by decomposing each example according to whether the initial draft \(parallel execution responses\) and final answer \(sequential execution responses\) are correct\. Figure[6](https://arxiv.org/html/2605.15573#S4.F6)reports rescue, harm, and preservation rates for Qwen2\.5\-7B agents on GSM8K\. As the tested team size increases fromN=5N=5toN=20N=20, the rescue rate rises from19\.2%19\.2\\%to23\.8%23\.8\\%, showing that additional agents provide useful opportunities for correcting initially wrong answers\. At the same time, harm remains low, between1\.6%1\.6\\%and2\.5%2\.5\\%, while preservation stays above97\.5%97\.5\\%across all tested values ofNN\. These results suggest thatNexadoes not simply perturb answers through extra communication; it mostly preserves correct predictions while selectively improving initially incorrect ones\.

Additional sparsity diagnostics in Appendix[D](https://arxiv.org/html/2605.15573#A4)show thatNexaoften selects low\-edge communication plans, indicating that the learned policy does not rely on dense all\-to\-all interaction as team size increases\.

![Refer to caption](https://arxiv.org/html/2605.15573v1/x6.png)Figure 6:Policy behavior analysis forNexawith Qwen2\.5\-7B agents on GSM8K\. Rescue, harm, and preservation rates compare initial draft correctness with final answer correctness after communication\.
### 4\.5Ablations

#### Policy backbone\.

Table 2:Backbone ablation on GSM8K with Qwen2\.5\-1\.5B agents\. Accuracy is mean±\\pmstd\. over three runs\.Nexais not tied to a single policy\-network backbone\. Although our main implementation uses a Transformer to predict response\-conditioned communication graphs, the same formulation can be instantiated with other graph\-prediction architectures\. As one example, we adapt the GNN architecture from GDesigner, originally used for agent\-role\-specific \(and fixed\-agent\-number\) design, toNexa’s response\-conditioned communication graph prediction setting while keeping the rest of the training and communication procedure unchanged\. Table[2](https://arxiv.org/html/2605.15573#S4.T2)shows that this GNN backbone closely matches the Transformer backbone on GSM8K with Qwen2\.5\-1\.5B agents, suggesting that the core benefit comes from theNexaformulation rather than a specific neural backbone\.

#### Policy optimization\.

Table 3:Policy\-optimization ablation on AQUA\-RAT withN=5N\{=\}5Qwen2\.5\-1\.5B agents\.We also compare the policy\-gradient objective used inNexawith a GRPO\-style alternative\. The ablation is conducted on AQUA\-RAT with Qwen2\.5\-1\.5B agents atN=5N=5, considering both same\-task training and cross\-task transfer from GSM8K\. As shown in Table[3](https://arxiv.org/html/2605.15573#S4.T3), PG slightly outperforms GRPO in both settings, with57\.7457\.74versus57\.5657\.56for AQUA→\\rightarrowAQUA and57\.5657\.56versus57\.4857\.48for GSM8K→\\rightarrowAQUA\. The gaps are small, indicating that the learned communication policy is not highly sensitive\.

## 5Related Works

#### LLM\-based multi\-agent collaboration\.

Multi\-agent LLM systems have been studied as role\-based societies, conversational workflows, and dynamically routed agent networks\. CAMEL instantiates role\-playing agents for cooperative problem solving\(Liet al\.,[2023](https://arxiv.org/html/2605.15573#bib.bib16)\), ChatDev organizes specialized agents into staged communicative workflows\(Qianet al\.,[2024](https://arxiv.org/html/2605.15573#bib.bib17)\), AutoGen provides a general framework for multi\-agent conversations\(Wuet al\.,[2024](https://arxiv.org/html/2605.15573#bib.bib6)\), and AgentVerse studies collaborative behaviors across agent groups\(Chenet al\.,[2024](https://arxiv.org/html/2605.15573#bib.bib29)\)\. DyLAN adapts the active agent set during task solving\(Liuet al\.,[2024](https://arxiv.org/html/2605.15573#bib.bib30)\), while multi\-agent debate methods use disagreement to improve reasoning or factuality\(Duet al\.,[2023](https://arxiv.org/html/2605.15573#bib.bib26); Lianget al\.,[2024](https://arxiv.org/html/2605.15573#bib.bib28)\)\. Multiagent finetuning further studies whether diverse reasoning chains can improve a base model through self\-improvement\(Subramaniamet al\.,[2025](https://arxiv.org/html/2605.15573#bib.bib27)\)\. These systems show that collaboration can improve reasoning, but they typically require a chosen communication protocol, a task\-specific workflow, or an explicit judging mechanism\.Nexainstead begins with independent responses and learns whether any sequential communication should occur at all\.

#### Communication topology and workflow design\.

Several recent methods treat agent orchestration as a graph or workflow optimization problem\. GPTSwarm represents language agents as optimizable computational graphs\(Zhugeet al\.,[2024](https://arxiv.org/html/2605.15573#bib.bib2)\); AgentPrune removes unnecessary communication to reduce costs\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.15573#bib.bib3)\); G\-Designer learns communication topologies with graph neural networks\(Zhanget al\.,[2025b](https://arxiv.org/html/2605.15573#bib.bib4)\); and MacNet studies scaling laws for LLM\-based multi\-agent collaboration\(Qianet al\.,[2025](https://arxiv.org/html/2605.15573#bib.bib11)\)\. Related work also explores training LLMs to construct multi\-agent systems\(Yeet al\.,[2025c](https://arxiv.org/html/2605.15573#bib.bib9)\), automated agentic workflow generation\(Huet al\.,[2025](https://arxiv.org/html/2605.15573#bib.bib31); Zhanget al\.,[2025c](https://arxiv.org/html/2605.15573#bib.bib32)\), decentralized evolutionary coordination\(Yanget al\.,[2025](https://arxiv.org/html/2605.15573#bib.bib186)\), self\-evolving agent profiles\(Luet al\.,[2024](https://arxiv.org/html/2605.15573#bib.bib187)\), heterogeneous multi\-agent systems\(Yeet al\.,[2025b](https://arxiv.org/html/2605.15573#bib.bib8)\), and unified experimental platforms for multi\-agent evaluation\(Yeet al\.,[2025a](https://arxiv.org/html/2605.15573#bib.bib10)\)\.Nexais closest in spirit to graph\-based topology learning but differs in three ways: the graph is conditioned on the realized response pool rather than only on a task or role template; the empty graph is a valid decision corresponding to pure parallel execution; and sparsity is controlled directly through an edge\-count penalty in the task reward\.

#### Judge\-free aggregation and acyclicity\.

Response selection and ensemble fusion are often performed by majority voting, learned rankers, or generative fusion models such as LLM\-Blender\(Jianget al\.,[2023](https://arxiv.org/html/2605.15573#bib.bib159)\); other systems introduce credibility scores or adversary\-resistant judges\(Ebrahimiet al\.,[2025](https://arxiv.org/html/2605.15573#bib.bib7)\)\. SelfOrg takes a different route by estimating response contribution from semantic embeddings and using that signal to organize communication without an external judge\(Tastanet al\.,[2026](https://arxiv.org/html/2605.15573#bib.bib1)\)\. Its contribution score is motivated by Shapley\-style valuation\(Shapley,[1953](https://arxiv.org/html/2605.15573#bib.bib24); Tastanet al\.,[2025](https://arxiv.org/html/2605.15573#bib.bib23)\)and can be computed from sentence embeddings\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.15573#bib.bib20)\)\.Nexakeeps this judge\-free contribution ordering but replaces the stochastic self\-organization rule with a trainable structure policy\. The method, therefore, preserves the stable ordering principle of SelfOrg while learning the forward edges that determine whether and where refinement should happen\.

## 6Conclusion

We introducedNexa, a response\-conditioned policy that bridges parallel and sequential multi\-agent execution by learning sparse acyclic communication graphs from initial agent drafts\. The method remains lightweight and judge\-free, can reduce to pure parallel execution, and improves the accuracy\-cost tradeoff while transferring across tasks, agent counts, and model settings\.

## References

- XVerify: efficient answer verifier for reasoning model evaluations\.External Links:2504\.10481,[Link](https://arxiv.org/abs/2504.10481)Cited by:[§3\.5](https://arxiv.org/html/2605.15573#S3.SS5.p1.3)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.CoRRabs/2107\.03374\.External Links:[Link](https://arxiv.org/abs/2107.03374),2107\.03374Cited by:[§F\.1](https://arxiv.org/html/2605.15573#A6.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.15573#S4.SS2.p1.2)\.
- W\. Chen, Y\. Su, J\. Zuo, C\. Yang, C\. Yuan, C\. Chan, H\. Yu, Y\. Lu, Y\. Hung, C\. Qian, Y\. Qin, X\. Cong, R\. Xie, Z\. Liu, M\. Sun, and J\. Zhou \(2024\)AgentVerse: facilitating multi\-agent collaboration and exploring emergent behaviors\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=EHg5GDnyq1)Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2605.15573#S4.SS1.p1.7),[§4\.2](https://arxiv.org/html/2605.15573#S4.SS2.p1.2)\.
- Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch \(2023\)Improving factuality and reasoning in language models through multiagent debate\.InForty\-first International Conference on Machine Learning,Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Ebrahimi, M\. Dehghankar, and A\. Asudeh \(2025\)An adversary\-resistant multi\-agent llm system via credibility scoring\.arXiv preprint arXiv:2505\.24239\.Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px3.p1.1)\.
- L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)PAL: program\-aided language models\.InProceedings of the 40th International Conference on Machine LearningProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)First Conference on Language ModelingThe Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks TrackThe Eleventh International Conference on Learning RepresentationsThe Thirty\-ninth Annual Conference on Neural Information Processing SystemsAdvances in Neural Information Processing Systems,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, J\. Scarlett, R\. Barzilay, M\. Kan, I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.20230,pp\. 10764–10799\.External Links:[Link](https://proceedings.mlr.press/v202/gao23f.html)Cited by:[§F\.1](https://arxiv.org/html/2605.15573#A6.SS1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InThirty\-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track \(Round 2\),External Links:[Link](https://openreview.net/forum?id=7Bywt2mQsCe)Cited by:[§F\.1](https://arxiv.org/html/2605.15573#A6.SS1.p1.1)\.
- S\. Hu, C\. Lu, and J\. Clune \(2025\)Automated design of agentic systems\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=t9U3LW7JVX)Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px2.p1.1)\.
- D\. Jiang, X\. Ren, and B\. Y\. Lin \(2023\)LLM\-blender: ensembling large language models with pairwise ranking and generative fusion\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 14165–14178\.External Links:[Link](https://aclanthology.org/2023.acl-long.792/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.792)Cited by:[§1](https://arxiv.org/html/2605.15573#S1.p2.1),[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px3.p1.1)\.
- G\. Li, H\. A\. A\. K\. Hammoud, H\. Itani, D\. Khizbullin, and B\. Ghanem \(2023\)CAMEL: communicative agents for ”mind” exploration of large language model society\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=3IyL2XWDkG)Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px1.p1.1)\.
- T\. Liang, Z\. He, W\. Jiao, X\. Wang, Y\. Wang, R\. Wang, Y\. Yang, S\. Shi, and Z\. Tu \(2024\)Encouraging divergent thinking in large language models through multi\-agent debate\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 17889–17904\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.992/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.992)Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px1.p1.1)\.
- W\. Ling, D\. Yogatama, C\. Dyer, and P\. Blunsom \(2017\)Program induction by rationale generation: learning to solve and explain algebraic word problems\.Vancouver, Canada,pp\. 158–167\.External Links:[Link](https://aclanthology.org/P17-1015/),[Document](https://dx.doi.org/10.18653/v1/P17-1015)Cited by:[§4\.1](https://arxiv.org/html/2605.15573#S4.SS1.p1.7),[§4\.2](https://arxiv.org/html/2605.15573#S4.SS2.p1.2)\.
- Z\. Liu, Y\. Zhang, P\. Li, Y\. Liu, and D\. Yang \(2024\)A dynamic LLM\-powered agent network for task\-oriented agent collaboration\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=XII0Wp1XA9)Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Lu, J\. Shao, B\. Luo, and T\. Lin \(2024\)Morphagent: empowering agents through self\-evolving profiles and decentralized collaboration\.arXiv preprint arXiv:2410\.15048\.Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px2.p1.1)\.
- C\. Qian, W\. Liu, H\. Liu, N\. Chen, Y\. Dang, J\. Li, C\. Yang, W\. Chen, Y\. Su, X\. Cong, J\. Xu, D\. Li, Z\. Liu, and M\. Sun \(2024\)ChatDev: communicative agents for software development\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15174–15186\.External Links:[Link](https://aclanthology.org/2024.acl-long.810/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.810)Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px1.p1.1)\.
- C\. Qian, Z\. Xie, Y\. Wang, W\. Liu, K\. Zhu, H\. Xia, Y\. Dang, Z\. Du, W\. Chen, C\. Yang, Z\. Liu, and M\. Sun \(2025\)Scaling large language model\-based multi\-agent collaboration\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=K3n5jPkrU6)Cited by:[§1](https://arxiv.org/html/2605.15573#S1.p2.1),[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px2.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§4\.1](https://arxiv.org/html/2605.15573#S4.SS1.p1.7)\.
- Qwen Team \(2026\)Qwen3\.5: towards native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§4\.3](https://arxiv.org/html/2605.15573#S4.SS3.SSS0.Px4.p1.4)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://arxiv.org/abs/1908.10084)Cited by:[§2](https://arxiv.org/html/2605.15573#S2.p4.2),[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px3.p1.1)\.
- L\. S\. Shapley \(1953\)A value for n\-person games\.InContributions to the Theory of Games II,H\. W\. Kuhn and A\. W\. Tucker \(Eds\.\),pp\. 307–317\.Cited by:[§2](https://arxiv.org/html/2605.15573#S2.p5.4),[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px3.p1.1)\.
- V\. Subramaniam, Y\. Du, J\. B\. Tenenbaum, A\. Torralba, S\. Li, and I\. Mordatch \(2025\)Multiagent finetuning: self improvement with diverse reasoning chains\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=JtGPIZpOrz)Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px1.p1.1)\.
- N\. Tastan, S\. Horváth, and K\. Nandakumar \(2025\)Aequa: Fair Model Rewards in Collaborative Learning via Slimmable Networks\.InProceedings of the 42nd International Conference on Machine Learning,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, and J\. Zhu \(Eds\.\),Proceedings of Machine Learning Research, Vol\.267,pp\. 59210–59236\.External Links:[Link](https://proceedings.mlr.press/v267/tastan25a.html)Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px3.p1.1)\.
- N\. Tastan, S\. Horváth, and K\. Nandakumar \(2026\)Stochastic self\-organization in multi\-agent systems\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rS3Jb9AAej)Cited by:[Appendix A](https://arxiv.org/html/2605.15573#A1.p2.1),[§2](https://arxiv.org/html/2605.15573#S2.p3.1),[§2](https://arxiv.org/html/2605.15573#S2.p5.4),[§3\.4](https://arxiv.org/html/2605.15573#S3.SS4.p3.4),[§4\.1](https://arxiv.org/html/2605.15573#S4.SS1.p2.1),[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px3.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)Cited by:[§3\.3](https://arxiv.org/html/2605.15573#S3.SS3.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.External Links:[Link](https://openreview.net/forum?id=1PL1NIMMrw)Cited by:[§1](https://arxiv.org/html/2605.15573#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.15573#S4.SS1.p2.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§4\.1](https://arxiv.org/html/2605.15573#S4.SS1.p2.1),[§4\.3](https://arxiv.org/html/2605.15573#S4.SS3.SSS0.Px1.p1.5)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu, A\. H\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang \(2024\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversations\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=BAakY1hNKS)Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Yang, H\. Chai, S\. Shao, Y\. Song, S\. Qi, R\. Rui, and W\. Zhang \(2025\)AgentNet: decentralized evolutionary coordination for LLM\-based multi\-agent systems\.External Links:[Link](https://openreview.net/forum?id=tXqLxHlb8Z)Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px2.p1.1)\.
- R\. Ye, K\. Huang, Q\. Wu, Y\. Cai, T\. Jin, X\. Pang, X\. Liu, J\. Su, C\. Qian, B\. Tang,et al\.\(2025a\)MASLab: a unified and comprehensive codebase for llm\-based multi\-agent systems\.arXiv preprint arXiv:2505\.16988\.Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px2.p1.1)\.
- R\. Ye, X\. Liu, Q\. Wu, X\. Pang, Z\. Yin, L\. Bai, and S\. Chen \(2025b\)X\-mas: towards building multi\-agent systems with heterogeneous llms\.arXiv preprint arXiv:2505\.16997\.Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px2.p1.1)\.
- R\. Ye, S\. Tang, R\. Ge, Y\. Du, Z\. Yin, S\. Chen, and J\. Shao \(2025c\)MAS\-GPT: training LLMs to build LLM\-based multi\-agent systems\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=3CiSpY3QdZ)Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px2.p1.1)\.
- G\. Zhang, Y\. Yue, Z\. Li, S\. Yun, G\. Wan, K\. Wang, D\. Cheng, J\. X\. Yu, and T\. Chen \(2025a\)Cut the crap: an economical communication pipeline for LLM\-based multi\-agent systems\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=LkzuPorQ5L)Cited by:[§3\.5](https://arxiv.org/html/2605.15573#S3.SS5.p1.6),[§4\.1](https://arxiv.org/html/2605.15573#S4.SS1.p2.1),[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px2.p1.1)\.
- G\. Zhang, Y\. Yue, X\. Sun, G\. Wan, M\. Yu, J\. Fang, K\. Wang, T\. Chen, and D\. Cheng \(2025b\)G\-designer: architecting multi\-agent communication topologies via graph neural networks\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=LpE54NUnmO)Cited by:[§1](https://arxiv.org/html/2605.15573#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.15573#S4.SS1.p2.1),[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px2.p1.1)\.
- J\. Zhang, J\. Xiang, Z\. Yu, F\. Teng, X\. Chen, J\. Chen, M\. Zhuge, X\. Cheng, S\. Hong, J\. Wang, B\. Zheng, B\. Liu, Y\. Luo, and C\. Wu \(2025c\)AFlow: automating agentic workflow generation\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=z5uVAKwmjf)Cited by:[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px2.p1.1)\.
- M\. Zhuge, W\. Wang, L\. Kirsch, F\. Faccio, D\. Khizbullin, and J\. Schmidhuber \(2024\)GPTSwarm: language agents as optimizable graphs\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 62743–62767\.External Links:[Link](https://proceedings.mlr.press/v235/zhuge24a.html)Cited by:[§1](https://arxiv.org/html/2605.15573#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.15573#S4.SS1.p2.1),[§5](https://arxiv.org/html/2605.15573#S5.SS0.SSS0.Px2.p1.1)\.

## Appendix ALimitations

Nexais evaluated primarily on reasoning and programming benchmarks where answer correctness can be measured reliably\. This focus allows controlled comparisons across task, agent count, model scale, and model generation, but leaves broader open\-ended settings such as long\-form generation, interactive tool use, and multi\-turn planning as natural directions for future evaluation\.

The method also depends on response embeddings\. If the embedding model fails to capture task\-relevant differences between candidate answers, the contribution ordering and graph policy may miss useful communication paths\. For the goodness of the selected embedding model, we refer the reader to\[Tastanet al\.,[2026](https://arxiv.org/html/2605.15573#bib.bib1)\]\.

Finally,Nexadeliberately uses one parallel draft round as the evidence\-gathering stage for deciding whether communication is needed\. This makes the sequential part selective and often sparse, but it also means that the initial agent pool size remains an important efficiency knob\. Future extensions could combineNexawith adaptive agent selection so that both the number of initial drafts and the communication graph are chosen instance by instance\.

## Appendix BAlgorithm

Algorithm 1Nexa1:Query

𝒬\{\\mathcal\{Q\}\}, agents

\{𝒜n\}n=1N\\\{\{\\mathcal\{A\}\}\_\{n\}\\\}\_\{n=1\}^\{N\}, encoder

ff, policy

pθ\(ℰ∣𝒳,π\)p\_\{\\theta\}\(\{\\mathcal\{E\}\}\\mid\{\\mathcal\{X\}\},\\pi\)
2:for

n=1n=1to

NNdo

3:

ℛn\(0\)←𝒜n\(𝒬\)\{\\mathcal\{R\}\}\_\{n\}^\{\(0\)\}\\leftarrow\{\\mathcal\{A\}\}\_\{n\}\(\{\\mathcal\{Q\}\}\),

rn←f\(ℛn\(0\)\)r\_\{n\}\\leftarrow f\(\{\\mathcal\{R\}\}\_\{n\}^\{\(0\)\}\)
4:endfor

5:

ravg←1N∑nrnr\_\{\\rm avg\}\\leftarrow\\frac\{1\}\{N\}\\sum\_\{n\}r\_\{n\},

ψn←cos⁡\(rn,ravg\)\\psi\_\{n\}\\leftarrow\\cos\(r\_\{n\},r\_\{\\rm avg\}\)
6:

π←argsort⁡\(\{ψn\}n=1N;desc\)\\pi\\leftarrow\\operatorname\{argsort\}\(\\\{\\psi\_\{n\}\\\}\_\{n=1\}^\{N\};\\mathrm\{desc\}\)
7:

𝒳←\[r1,…,rN\]⊤\{\\mathcal\{X\}\}\\leftarrow\[r\_\{1\},\\dots,r\_\{N\}\]^\{\\top\},

ℋ←Encθ⁡\(𝒳\)\{\\mathcal\{H\}\}\\leftarrow\\operatorname\{Enc\}\_\{\\theta\}\(\{\\mathcal\{X\}\}\)
8:Compute masked forward logits

ℓm→n\\ell\_\{m\\to n\}
9:Sample/decode

ℰ∼pθ\(⋅∣𝒳,π\)\{\\mathcal\{E\}\}\\sim p\_\{\\theta\}\(\\cdot\\mid\{\\mathcal\{X\}\},\\pi\)
10:if

ℰ=∅\{\\mathcal\{E\}\}=\\varnothingthenreturn centroid response from

\{ℛn\(0\)\}n=1N\\\{\{\\mathcal\{R\}\}\_\{n\}^\{\(0\)\}\\\}\_\{n=1\}^\{N\}
11:foreach node

nnin order

π\\pido

12:

Pa⁡\(n\)←\{m:\(m→n\)∈ℰ\}\\operatorname\{Pa\}\(n\)\\leftarrow\\\{m:\(m\\to n\)\\in\{\\mathcal\{E\}\}\\\}
13:

ℛn\(1\)←\{𝒜n\(𝒬,\{ℛm\(⋆\):m∈Pa⁡\(n\)\}\),Pa⁡\(n\)≠∅,ℛn\(0\),otherwise\.\{\\mathcal\{R\}\}\_\{n\}^\{\(1\)\}\\leftarrow\\begin\{cases\}\{\\mathcal\{A\}\}\_\{n\}\\\!\\left\(\{\\mathcal\{Q\}\},\\\{\{\\mathcal\{R\}\}\_\{m\}^\{\(\\star\)\}:m\\in\\operatorname\{Pa\}\(n\)\\\}\\right\),&\\operatorname\{Pa\}\(n\)\\neq\\varnothing,\\\\ \{\\mathcal\{R\}\}\_\{n\}^\{\(0\)\},&\\text\{otherwise\.\}\\end\{cases\}
14:endfor

15:returncentroid response from

\{ℛn\(1\)\}n=1N\\\{\{\\mathcal\{R\}\}\_\{n\}^\{\(1\)\}\\\}\_\{n=1\}^\{N\}

## Appendix CPolicy Behavior with Smaller Agents

We observe the same qualitative behavior with Qwen2\.5\-1\.5B agents\. As shown in Figure[7](https://arxiv.org/html/2605.15573#A3.F7), the rescue rate increases from15\.6%15\.6\\%atN=5N=5to20\.6%20\.6\\%atN=20N=20, while the preservation rate remains above92\.8%92\.8\\%for all tested team sizes\. Harm is higher than in the 7B setting \(refer to Figure[6](https://arxiv.org/html/2605.15573#S4.F6)\), ranging from5\.6%5\.6\\%to7\.2%7\.2\\%, which is expected given the weaker base agents\. Nevertheless, the policy still improves a meaningful fraction of initially wrong answers while preserving the vast majority of initially correct ones, indicating that the same communication behavior appears even with smaller models\.

![Refer to caption](https://arxiv.org/html/2605.15573v1/x7.png)Figure 7:Policy behavior analysis forNexawith Qwen2\.5\-1\.5B agents on GSM8K\. Rescue, harm, and preservation rates are computed by comparing each initial draft answer with the final answer after communication\.Nexarescues15\.6%15\.6\\%\-20\.6%20\.6\\%of initially wrong answers while preserving92\.8%92\.8\\%\-94\.4%94\.4\\%of initially correct answers across tested team sizes\.
## Appendix DCommunication Sparsity

Figure[8](https://arxiv.org/html/2605.15573#A4.F8)reports the fraction of low\-edge communication plans produced byNexaon GSM8K\. For Qwen2\.5\-1\.5B, low\-edge plans occur in35\.0%35\.0\\%of examples atN=5N=5and remain near4747–50%50\\%for larger tested team sizes\. For Qwen2\.5\-7B, the fraction increases from43\.4%43\.4\\%atN=5N=5to more than70%70\\%atN=15N=15andN=20N=20\. These results suggest that increasing the number of agents does not force dense communication; the learned policy often selects sparse interaction patterns\.

We also observe that, as we scale the capability of the backbone, it leads to more frequent sparse communication than less capable or weaker backbone, indicating that as individual agents become more capable, the policy can rely on fewer communication edges\.

![Refer to caption](https://arxiv.org/html/2605.15573v1/x8.png)Figure 8:Communication sparsity forNexaon GSM8K with Qwen2\.5\-1\.5B and Qwen2\.5\-7B agents\. We report the fraction of examples whose predicted communication graph uses at most half of the possible edges\. Across both model sizes,Nexafrequently selects low\-edge plans, indicating that the learned policy does not rely on dense all\-to\-all communication as team size increases\.
## Appendix EExperimental Settings

#### Compute resources\.

All experiments were run on NVIDIA A100 40GB GPUs\. The same GPU class was used both to serve the LLM agents during multi\-agent inference and to train theNexapolicy\.

#### Backbone ablation setting\.

For the policy\-backbone ablation, we restrict training to 5 epochs for both the Transformer and GNN variants to keep the comparison controlled and computationally lightweight\. Both variants use the sameNexatraining objective, the same GSM8K training setting, Qwen2\.5\-1\.5B agents, sampled graph plans at evaluation, temperature0\.50\.5, and three random repeats\. The only changed component is the policy\-network backbone used to score response\-conditioned communication graphs\. The GNN variant adapts the graph neural architecture from GDesigner from agent\-role\-specific graph design toNexa’s response\-conditioned graph prediction setting\.

#### Policy\-optimization ablation setting\.

For the policy\-optimization ablation, we compare the policy\-gradient objective used inNexaagainst a GRPO\-style update\. Both use Qwen2\.5\-1\.5B agents, response\-only inputs, contribution\-based ordering, weighted aggregation, and XVerify\-based evaluation at temperature0\.00\.0\. Evaluation is conducted on AQUA\-RAT atN=5N=5for both same\-task training, AQUA\-RAT→\\rightarrowAQUA\-RAT, and cross\-task transfer, GSM8K→\\rightarrowAQUA\-RAT\.

Both variants use ‘transformer’ backbone,5050policy\-training iterations, batch size3232, learning rate0\.10\.1, a hidden dimension128128,11transformer layers,11attention heads, dropout0\.30\.3, Adam optimizer, gradient clipping1\.01\.0, and batch\-mean baseline\. For GRPO, we use44rollouts\.

## Appendix FAdditional Experiments

### F\.1Extended Task Transfer

We additionally provide studies on whether a policy trained on one source task can transfer to additional target tasks beyond the main transfer experiments\. In this setting,Nexais trained on GSM8K and evaluated without retraining on GSM\-Hard\[Gaoet al\.,[2023](https://arxiv.org/html/2605.15573#bib.bib176)\], HumanEval\[Chenet al\.,[2021](https://arxiv.org/html/2605.15573#bib.bib184)\], and MMLU\[Hendryckset al\.,[2021](https://arxiv.org/html/2605.15573#bib.bib175)\]\. We compare against single\-call and chain\-of\-thought baselines using the same target\-task evaluation protocol\. As shown in Table[4](https://arxiv.org/html/2605.15573#A6.T4), the GSM8K\-trainedNexapolicy improves over both baselines on GSM\-Hard and HumanEval and remains comparable to the baselines on MMLU\. These results suggest that the learned communication policy can transfer beyond the training task, especially when the target task benefits from structured multi\-agent reasoning\.

Table 4:Extended task\-transfer results\.Nexais trained on GSM8K and evaluated without retraining on GSM\-Hard, HumanEval, and MMLU\. Accuracy is reported as mean±\\pmstandard deviation\.
Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems

Similar Articles

Nexa os bu infinixa

Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

Nexus : An Agentic Framework for Time Series Forecasting

Submit Feedback

Similar Articles

Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics
NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning
MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning
Nexus : An Agentic Framework for Time Series Forecasting