Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains

arXiv cs.CL Papers

Summary

This paper proposes TopoPrior, a framework that learns transferable topology priors from offline reference collaboration graphs to generate initial topologies for multi-agent LLM collaboration across domains, significantly reducing online search overhead and token consumption.

arXiv:2605.17359v1 Announce Type: new Abstract: Large language model (LLM)-based multi-agent systems have shown strong potential for complex reasoning by coordinating specialized agents through structured communication. However, existing topology-evolution methods typically construct or optimize a collaboration topology for each query from scratch, leading to substantial online search overhead, high inference-time token consumption, and limited scalability in multi-domain settings. We propose TopoPrior, a framework for learning transferable topology priors for multi-agent LLM collaboration across domains. Rather than repeatedly searching for effective collaboration structures online, TopoPrior learns reusable topology priors from reference collaboration graphs collected offline from multiple domains and uses them to generate query-conditioned initial collaboration graphs for downstream refinement. By shifting part of topology search from per-query online optimization to offline prior learning, TopoPrior amortizes search cost while remaining compatible with existing topology-evolution backbones. Technically, TopoPrior contains two key components. First, a transferable topology prior learning module employs a conditional variational graph framework to capture reusable structural regularities across domains in a latent space. Second, a query-conditioned latent adaptation module introduces adversarial alignment to reduce unnecessary domain discrepancy while preserving query-relevant structural variation. Experiments on multi-domain reasoning benchmarks show that TopoPrior consistently improves several heterogeneous topology-evolution backbones while reducing online inference-time token usage, with only modest additional trainable parameters. These results suggest that transferable topology initialization is an effective and lightweight mechanism for improving the efficiency of multi-agent LLM collaboration across domains.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:39 AM

# Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains
Source: [https://arxiv.org/html/2605.17359](https://arxiv.org/html/2605.17359)
Taolin Zhang1, Zijie Zhou2, Jiuheng Wan1, Tingyuan Hu3, Chengyu Wang4, Xiaofeng He3, and Richang Hong11Taolin Zhang, Jiuheng Wan, and Richang Hong are with the Hefei University of Technology, Hefei 230002, China \(e\-mail: tlzhang@hfut\.edu\.cn; wan\_jiuheng@163\.com; hongrc@hfut\.edu\.cn\)\. 2Zijie Zhou is with the China University of Petroleum \(Beijing\), Beijing 102249, China \(e\-mail: zjzhouzh@gmail\.com\)\. 3Tingyuan Hu and Xiaofeng He are with the East China Normal University, Shanghai 200062, China \(e\-mail: 10245102409@stu\.ecnu\.edu\.cn; hexf@cs\.ecnu\.edu\.cn\)\. 4Chengyu Wang is with the Alibaba Group, Hangzhou 310052, China \(e\-mail: chengyu\.wcy@alibaba\-inc\.com\)\. Corresponding authors: Chengyu Wang and Richang Hong\.

###### Abstract

Large language model \(LLM\)\-based multi\-agent systems have shown strong potential for complex reasoning by coordinating specialized agents through structured communication\. However, existing topology\-evolution methods typically construct or optimize a collaboration topology for each query from scratch, leading to substantial online search overhead, high inference\-time token consumption, and limited scalability in multi\-domain settings\. We proposeTopoPrior, a framework for learning transferable topology priors for multi\-agent LLM collaboration across domains\. Rather than repeatedly searching for effective collaboration structures online,TopoPriorlearns reusable topology priors from reference collaboration graphs collected offline from multiple domains and uses them to generate query\-conditioned initial collaboration graphs for downstream refinement\. By shifting part of topology search from per\-query online optimization to offline prior learning,TopoPrioramortizes search cost while remaining compatible with existing topology\-evolution backbones\. Technically,TopoPriorcontains two key components\. First, atransferable topology prior learningmodule employs a conditional variational graph framework to capture reusable structural regularities across domains in a latent space\. Second, aquery\-conditioned latent adaptationmodule introduces adversarial alignment to reduce unnecessary domain discrepancy while preserving query\-relevant structural variation\. Experiments on multi\-domain reasoning benchmarks show thatTopoPriorconsistently improves several heterogeneous topology\-evolution backbones while reducing online inference\-time token usage, with only modest additional trainable parameters\. These results suggest that transferable topology initialization is an effective and lightweight mechanism for improving the efficiency of multi\-agent LLM collaboration across domains\.

## IIntroduction

Large language model \(LLM\)\-based multi\-agent systems have emerged as a promising paradigm for complex reasoning by coordinating multiple specialized agents through structured communication\. This paradigm is particularly appealing in multi\-domain settings, such as healthcare\[[39](https://arxiv.org/html/2605.17359#bib.bib45)\], science\[[15](https://arxiv.org/html/2605.17359#bib.bib78),[21](https://arxiv.org/html/2605.17359#bib.bib26)\], and law\[[41](https://arxiv.org/html/2605.17359#bib.bib77),[27](https://arxiv.org/html/2605.17359#bib.bib75)\], where different queries may require different combinations of expertise and collaboration patterns\. A central challenge in this setting is how to*reuse*effective collaboration structures across domains, so that multi\-agent systems can adapt to new tasks without repeatedly searching for communication topologies from scratch\.

Existing approaches to multi\-domain LLM reasoning can be broadly grouped into three categories, each with different trade\-offs in flexibility, effectiveness, and computational cost\.Training\-free promptingmethods adapt frozen LLMs through prompt engineering, few\-shot demonstrations, and chain\-of\-thought reasoning\[[4](https://arxiv.org/html/2605.17359#bib.bib9),[42](https://arxiv.org/html/2605.17359#bib.bib24),[20](https://arxiv.org/html/2605.17359#bib.bib23)\]\. Although easy to deploy, their performance is ultimately limited by the capability of the underlying model\[[57](https://arxiv.org/html/2605.17359#bib.bib73)\]\.Fine\-tuning\-based methods specialize models to target domains through parameter updates\[[46](https://arxiv.org/html/2605.17359#bib.bib71),[35](https://arxiv.org/html/2605.17359#bib.bib72)\], but they may suffer from catastrophic forgetting and introduce substantial training and storage overhead when many domains are involved\[[31](https://arxiv.org/html/2605.17359#bib.bib11),[34](https://arxiv.org/html/2605.17359#bib.bib21)\]\. By contrast,multi\-agent collaborationimproves reasoning by decomposing tasks across specialized agents and organizing their interactions through communication topologies\. Recent work has progressed from static topologies to dynamic graph construction, including reinforcement\-learned, pruning\-based, and autoregressive approaches\[[18](https://arxiv.org/html/2605.17359#bib.bib94),[36](https://arxiv.org/html/2605.17359#bib.bib105),[54](https://arxiv.org/html/2605.17359#bib.bib8),[55](https://arxiv.org/html/2605.17359#bib.bib100)\]\. However, most of these methods treat topology construction as a query\-level optimization problem and often perform search from scratch, which can incur substantial online overhead, accumulated communication noise, and high inference\-time token consumption in multi\-domain settings\[[29](https://arxiv.org/html/2605.17359#bib.bib15)\]; see Fig\.[1](https://arxiv.org/html/2605.17359#S1.F1)\.

![Refer to caption](https://arxiv.org/html/2605.17359v1/x1.png)Figure 1:Comparison of reasoning paradigms across domains\.\(1\) Training\-free methodsrely on frozen LLMs and are ultimately limited by the intrinsic capability of the underlying model\.\(2\) Training\-intensive methodsupdate model parameters, but may incur catastrophic forgetting and substantial computational overhead across domains\.\(3\) Multi\-agent methodsdynamically construct collaboration topologies, yet repeatedly optimizing graphs from scratch for each query can incur high online token cost\. Our method learns transferable topology priors that provide query\-aware initialization for downstream topology evolution\.This limitation motivates a different perspective:instead of treating every query as a new topology\-search problem, can we learn reusable collaboration patterns offline and use them to initialize downstream topology evolution?To this end, we proposeTopoPrior, a framework for learning transferable topology priors for multi\-agent LLM collaboration across domains\. The key idea is to shift part of the topology\-search burden from per\-query online graph construction to offline cross\-domain prior learning\. Rather than evolving a collaboration graph from scratch for every incoming query,TopoPriorlearns reusable topology priors from reference collaboration graphs collected across multiple domains, and then uses these priors to generate query\-conditioned initial collaboration graphs for downstream refinement\. In this way,TopoPrioramortizes part of the topology\-search cost across queries and domains while remaining compatible with existing topology\-evolution backbones\.

TopoPriorcontains two key components\.\(1\) Transferable Topology Prior Learning\.We develop a conditional variational graph framework to capture reusable structural regularities in collaboration graphs across domains\. Specifically, a variational encoder maps collaboration graphs and their associated queries into a latent space, while a conditional generator reconstructs query\-conditioned initial topologies from the learned prior and the input query\. This design enablesTopoPriorto model collaboration structures that can be reused across related domains, rather than relearned independently for each query\.\(2\) Query\-Conditioned Latent Adaptation\.While transferable priors can improve initialization efficiency, effective collaboration still requires sensitivity to query\-relevant structural variation\. We therefore introduce an adversarial latent adaptation module to reduce unnecessary domain discrepancy while preserving query\-dependent structural characteristics\. As a result, the generated topologies reflect both reusable collaboration regularities and task\-specific specialization\.

We evaluateTopoPrioron MMLU\[[15](https://arxiv.org/html/2605.17359#bib.bib78)\]and C\-Eval\[[21](https://arxiv.org/html/2605.17359#bib.bib26)\]under multi\-domain settings\. Experimental results show that integratingTopoPriorwith existing topology\-evolution backbones improves downstream performance in most evaluated settings, reduces online inference\-time token usage by up to40\.22%, and introduces only3\.3Madditional trainable parameters for the topology\-prior generator\. These results suggest that transferable topology initialization can serve as an effective and lightweight mechanism for improving the efficiency of multi\-agent LLM collaboration across domains\.

The main contributions of this work are threefold:

- •We formulate topology initialization for multi\-agent LLM systems in multi\-domain settings as a transferable topology prior learning problem, in which reusable collaboration structures are learned offline and reused to improve downstream topology evolution\.
- •We proposeTopoPrior, a lightweight framework that combines conditional variational topology prior learning with query\-conditioned latent adaptation to generate informative initial collaboration graphs\.
- •We conduct experiments on MMLU and C\-Eval with multiple topology\-evolution backbones, showing thatTopoPriorimproves downstream reasoning performance and online inference token efficiency across the evaluated settings with modest parameter overhead\.

## IIRelated Work

### II\-AMulti\-Domain Reasoning with LLMs

Reasoning across multiple domains has become an increasingly important topic with the rise of LLMs\[[26](https://arxiv.org/html/2605.17359#bib.bib31),[5](https://arxiv.org/html/2605.17359#bib.bib32),[43](https://arxiv.org/html/2605.17359#bib.bib33)\]\. Existing approaches mainly follow two paradigms\.*Training\-free methods*adapt frozen LLMs through in\-context learning, prompt engineering, and chain\-of\-thought reasoning\[[4](https://arxiv.org/html/2605.17359#bib.bib9),[37](https://arxiv.org/html/2605.17359#bib.bib28),[42](https://arxiv.org/html/2605.17359#bib.bib24),[20](https://arxiv.org/html/2605.17359#bib.bib23)\]\. They are easy to deploy and avoid parameter updates, but their performance is ultimately limited by the capability of the underlying model\.*Training\-intensive methods*adapt LLMs through supervised fine\-tuning or parameter\-efficient tuning such as LoRA\[[19](https://arxiv.org/html/2605.17359#bib.bib22)\]\. Although effective for domain specialization, they may suffer from catastrophic forgetting and require additional storage and maintenance when many domains are involved\[[45](https://arxiv.org/html/2605.17359#bib.bib103),[32](https://arxiv.org/html/2605.17359#bib.bib30),[34](https://arxiv.org/html/2605.17359#bib.bib21),[31](https://arxiv.org/html/2605.17359#bib.bib11)\]\. In contrast to prompt\-level or parameter\-level adaptation, our work studies a different axis of adaptation, namely structural adaptation at the level of inter\-agent communication\. Specifically, rather than modifying prompts or updating LLM parameters, we improve multi\-domain reasoning efficiency by learning reusable topology priors for multi\-agent collaboration\.

### II\-BTopology Design in Multi\-Agent LLM Systems

Multi\-agent LLM systems improve reasoning by organizing agent interactions through communication topologies\. Early methods mainly adopt predefined interaction patterns, including independent aggregation\[[23](https://arxiv.org/html/2605.17359#bib.bib88),[56](https://arxiv.org/html/2605.17359#bib.bib87),[10](https://arxiv.org/html/2605.17359#bib.bib89)\], chain\-based communication\[[17](https://arxiv.org/html/2605.17359#bib.bib95),[44](https://arxiv.org/html/2605.17359#bib.bib93),[18](https://arxiv.org/html/2605.17359#bib.bib94)\], star\-style coordination\[[52](https://arxiv.org/html/2605.17359#bib.bib82),[59](https://arxiv.org/html/2605.17359#bib.bib97)\], and tree\-structured hierarchies\[[22](https://arxiv.org/html/2605.17359#bib.bib98)\]\. While effective, these fixed designs are often inflexible when task requirements vary across inputs or domains\.

More recent studies dynamically construct or optimize collaboration graphs\. GPTSwarm\[[61](https://arxiv.org/html/2605.17359#bib.bib44)\]and DyLAN\[[36](https://arxiv.org/html/2605.17359#bib.bib105)\]adapt agent interactions through reinforcement learning or dynamic agent selection\. Pruning\-based methods remove redundant edges or agents to obtain task\-adaptive sparse graphs\[[54](https://arxiv.org/html/2605.17359#bib.bib8),[50](https://arxiv.org/html/2605.17359#bib.bib39)\], while autoregressive approaches generate collaboration structures conditioned on the input query\[[55](https://arxiv.org/html/2605.17359#bib.bib100),[33](https://arxiv.org/html/2605.17359#bib.bib47)\]\. Although these methods have shown strong reasoning performance, most of them construct or optimize topologies at query time and often search from scratch, incurring substantial online search cost and inference\-time token overhead\. Our work is complementary to this line of research: rather than replacing downstream topology evolution, we learn transferable topology priors that provide stronger initialization for it\.

### II\-CTransferable Structure Learning and Graph Initialization

Reusing structural knowledge across tasks or domains is important for efficient adaptation\. Prior work in domain generalization and representation transfer suggests that shared latent structure can support transfer across heterogeneous settings\[[11](https://arxiv.org/html/2605.17359#bib.bib69),[31](https://arxiv.org/html/2605.17359#bib.bib11),[9](https://arxiv.org/html/2605.17359#bib.bib41)\]\. In graph representation and generation learning, variational and conditional generative frameworks have been used to model reusable structural patterns from observed graphs\[[24](https://arxiv.org/html/2605.17359#bib.bib91),[3](https://arxiv.org/html/2605.17359#bib.bib40)\]\. More broadly, warm\-start initialization and structure reuse have long been recognized as practical strategies for reducing optimization cost in complex search problems\[[14](https://arxiv.org/html/2605.17359#bib.bib68),[2](https://arxiv.org/html/2605.17359#bib.bib80)\]\. However, existing multi\-agent LLM methods treat topology construction as a query\-level search problem without learning transferable priors across domains, thus failing to amortize reusable collaboration patterns and keeping graph construction query\-specific\.

In contrast, our work introduces two key differences: \(1\) learning a*transferable topology prior*for graph initialization instead of searching from scratch per query; \(2\) combining prior learning with query\-conditioned latent adaptation to preserve both cross\-domain regularities and query\-specific specialization\. This design amortizes topology\-search cost across domains and enhances downstream efficiency\.

## IIITopoPrior: Framework

In this section, we formalize the problem of transferable topology prior learning for multi\-agent LLM collaboration in multi\-domain settings and present the proposedTopoPriorframework\. An overview ofTopoPrioris shown in Fig\.[2](https://arxiv.org/html/2605.17359#S3.F2)\.

### III\-AProblem Definition

Assume that we are given training data fromKKdomains, denoted by𝒟=⋃k=1K𝒟k\\mathcal\{D\}=\\bigcup\_\{k=1\}^\{K\}\\mathcal\{D\}^\{k\}, where𝒟k=\{\(qik,𝔾ik,yik\)\}i=1Mk\\mathcal\{D\}^\{k\}=\\\{\(q\_\{i\}^\{k\},\\mathbb\{G\}\_\{i\}^\{k\},y\_\{i\}^\{k\}\)\\\}\_\{i=1\}^\{M\_\{k\}\}\. Here,qikq\_\{i\}^\{k\}is a query from thekk\-th domain,yiky\_\{i\}^\{k\}is the corresponding ground\-truth answer, andMkM\_\{k\}is the number of samples in that domain\.𝔾ik=\(𝒱ik,ℰik\)\\mathbb\{G\}\_\{i\}^\{k\}=\(\\mathcal\{V\}\_\{i\}^\{k\},\\mathcal\{E\}\_\{i\}^\{k\}\)denotes a reference collaboration graph produced by a strong topology\-evolution method\.111In our implementation, we use AgentDropout\[[50](https://arxiv.org/html/2605.17359#bib.bib39)\]to construct reference collaboration graphs\.The reference graph is not assumed to be globally optimal; rather, it serves as an effective but imperfect source of structural supervision for learning transferable topology priors across domains\.

Let𝒜ik∈ℝN×N\\mathcal\{A\}\_\{i\}^\{k\}\\in\\mathbb\{R\}^\{N\\times N\}denote the adjacency matrix derived from𝔾ik\\mathbb\{G\}\_\{i\}^\{k\}, whereNNis the size of the candidate role pool and serves as an upper bound on the number of generated agent nodes\. The goal ofTopoPrioris to learn a query\-conditioned topology initializer that maps a new queryqqto an initial collaboration graph𝔾^\\hat\{\\mathbb\{G\}\}, such that𝔾^\\hat\{\\mathbb\{G\}\}provides an informative starting point for downstream topology evolution\. At inference time, the learned initializer is integrated with an existing topology\-evolution backbone, which further refines𝔾^\\hat\{\\mathbb\{G\}\}according to its original task\-specific optimization process\. In our framework, topology\-prior learning primarily uses query–graph pairs\(qik,𝔾ik\)\(q\_\{i\}^\{k\},\\mathbb\{G\}\_\{i\}^\{k\}\), while the task labelsyiky\_\{i\}^\{k\}are retained for reference\-graph construction and downstream task evaluation\.

![Refer to caption](https://arxiv.org/html/2605.17359v1/x2.png)Figure 2:Overview ofTopoPrior\. \(1\)*Transferable Topology Prior Learning*captures reusable collaboration patterns from multiple domains through a conditional variational graph framework\. \(2\)*Query\-Conditioned Latent Adaptation*improves cross\-domain robustness by adversarially regularizing the latent space while retaining query\-relevant structural information\.
### III\-BTransferable Topology Prior Learning

The first component ofTopoPrioraims to learn reusable topology priors that can initialize collaboration graphs across domains\. Since queries from different domains may induce distinct yet structurally related collaboration patterns, we adopt a conditional variational graph framework\[[24](https://arxiv.org/html/2605.17359#bib.bib91)\]to encode reference graphs and queries into a shared latent space and reconstruct query\-conditioned initial topologies from the resulting latent prior\.

Formally, given a query–graph pair\(q,𝔾\)\(q,\\mathbb\{G\}\), we model the conditional likelihood of the collaboration graph as

log⁡pθ​\(𝔾∣q\)≥𝔼z∼qϕ​\(z∣𝔾,q\)​\[log⁡pθ​\(𝔾∣z,q\)\]−KL\(qϕ\(z∣𝔾,q\)∥pθ′\(z∣q\)\),\\log p\_\{\\theta\}\(\\mathbb\{G\}\\mid q\)\\geq\\mathbb\{E\}\_\{z\\sim q\_\{\\phi\}\(z\\mid\\mathbb\{G\},q\)\}\\big\[\\log p\_\{\\theta\}\(\\mathbb\{G\}\\mid z,q\)\\big\]\\\\ \-\\mathrm\{KL\}\\\!\\left\(q\_\{\\phi\}\(z\\mid\\mathbb\{G\},q\)\\,\\\|\\,p^\{\\prime\}\_\{\\theta\}\(z\\mid q\)\\right\),\(1\)whereqϕ​\(z∣𝔾,q\)q\_\{\\phi\}\(z\\mid\\mathbb\{G\},q\)is the variational encoder,pθ​\(𝔾∣z,q\)p\_\{\\theta\}\(\\mathbb\{G\}\\mid z,q\)is the conditional graph generator, andpθ′​\(z∣q\)p^\{\\prime\}\_\{\\theta\}\(z\\mid q\)is a query\-conditioned prior\. This formulation allowsTopoPriorto capture reusable structural regularities from reference collaboration graphs while preserving query\-specific adaptation through the conditional prior\.

Variational Encoder\.Given a query and its reference graph\(q,𝔾\)\(q,\\mathbb\{G\}\), the encoderqϕ​\(z∣𝔾,q\)q\_\{\\phi\}\(z\\mid\\mathbb\{G\},q\)uses a Graph Convolutional Network \(GCN\)\[[25](https://arxiv.org/html/2605.17359#bib.bib38)\]to encode the collaboration graph222All other alternative graph modeling methods can serve as replacements for this practice\.:

𝐡v\(l\+1\)=fσ​\(𝒟~−12​𝒜~ik​𝒟~−12​𝐡v\(l\)​𝐖ve\(l\)\),\\mathbf\{h\}\_\{v\}^\{\(l\+1\)\}=f\_\{\\sigma\}\\\!\\left\(\\tilde\{\\mathcal\{D\}\}^\{\-\\frac\{1\}\{2\}\}\\tilde\{\\mathcal\{A\}\}\_\{i\}^\{k\}\\tilde\{\\mathcal\{D\}\}^\{\-\\frac\{1\}\{2\}\}\\mathbf\{h\}\_\{v\}^\{\(l\)\}\\mathbf\{W\}\_\{\\mathrm\{ve\}\}^\{\(l\)\}\\right\),\(2\)where𝒜~ik=𝒜ik\+I\\tilde\{\\mathcal\{A\}\}\_\{i\}^\{k\}=\\mathcal\{A\}\_\{i\}^\{k\}\+Iis the adjacency matrix with self\-loops,𝒟~\\tilde\{\\mathcal\{D\}\}is the corresponding degree matrix, and𝐖ve\(l\)\\mathbf\{W\}\_\{\\mathrm\{ve\}\}^\{\(l\)\}are trainable weights\. The query representation𝐡q\\mathbf\{h\}\_\{q\}and the initial agent\-role representation𝐡v\(0\)\\mathbf\{h\}\_\{v\}^\{\(0\)\}are obtained from a frozen text encoder such as BERT\[[8](https://arxiv.org/html/2605.17359#bib.bib92)\]\. We then compute a graph\-level representation by sum pooling over node embeddings:

𝐡𝔾=∑n=1N𝐡vn\.\\mathbf\{h\}\_\{\\mathbb\{G\}\}=\\sum\_\{n=1\}^\{N\}\\mathbf\{h\}\_\{v\_\{n\}\}\.\(3\)The query and graph representations are fused as

𝐡task=MLPtask​\(𝐡q∥𝐡𝔾\)\.\\mathbf\{h\}\_\{\\mathrm\{task\}\}=\\mathrm\{MLP\}\_\{\\mathrm\{task\}\}\(\\mathbf\{h\}\_\{q\}\\parallel\\mathbf\{h\}\_\{\\mathbb\{G\}\}\)\.\(4\)The latent variablezzis sampled from the Gaussian posterior:

𝝁=𝐖μ​𝐡task\+𝐛μ,\\displaystyle\\boldsymbol\{\\mu\}=\\mathbf\{W\}\_\{\\mu\}\\mathbf\{h\}\_\{\\mathrm\{task\}\}\+\\mathbf\{b\}\_\{\\mu\},\(5\)log⁡𝝈2=𝐖σ​𝐡task\+𝐛σ,\\displaystyle\\log\\boldsymbol\{\\sigma\}^\{2\}=\\mathbf\{W\}\_\{\\sigma\}\\mathbf\{h\}\_\{\\mathrm\{task\}\}\+\\mathbf\{b\}\_\{\\sigma\},\(6\)qϕ​\(z∣𝔾,q\)=𝒩​\(z;𝝁,diag​\(𝝈2\)\),\\displaystyle q\_\{\\phi\}\(z\\mid\\mathbb\{G\},q\)=\\mathcal\{N\}\\\!\\left\(z;\\boldsymbol\{\\mu\},\\mathrm\{diag\}\(\\boldsymbol\{\\sigma\}^\{2\}\)\\right\),\(7\)wherez=𝝁\+𝝈⊙εz=\\boldsymbol\{\\mu\}\+\\boldsymbol\{\\sigma\}\\odot\\varepsilon,ε∼𝒩​\(0,I\)\\varepsilon\\sim\\mathcal\{N\}\(0,I\), and𝝈=exp⁡\(12​log⁡𝝈2\)\\boldsymbol\{\\sigma\}=\\exp\\\!\\left\(\\frac\{1\}\{2\}\\log\\boldsymbol\{\\sigma\}^\{2\}\\right\)\.

Conditional Generator\.The conditional generator reconstructs an initial collaboration graph autoregressively:

pθ​\(𝔾∣z,q\)=∏t=1Tpθ​\(vt∣U<t,z,q\)×∏t=2T∏s=1t−1pθ\(es→t∣vs,vt,U<t,z,q\),p\_\{\\theta\}\(\\mathbb\{G\}\\mid z,q\)=\\prod\_\{t=1\}^\{T\}p\_\{\\theta\}\(v\_\{t\}\\mid U\_\{<t\},z,q\)\\\\ \\times\\prod\_\{t=2\}^\{T\}\\prod\_\{s=1\}^\{t\-1\}p\_\{\\theta\}\(e\_\{s\\rightarrow t\}\\mid v\_\{s\},v\_\{t\},U\_\{<t\},z,q\),\(8\)whereU<tU\_\{<t\}denotes the set of previously generated nodes, andTTis the total number of generation steps\.pθ​\(es→t\)p\_\{\\theta\}\(e\_\{s\\rightarrow t\}\)represents the generation probability of edges between nodes\. If it is greater than the thresholdδe\\delta\_\{e\}during inference, a newly generated edge will be added toℰ\\mathcal\{E\}\. In our implementation, node generation selects agent roles from a predefined role pool, and edge generation predicts directed communication links between generated nodes conditioned on the latent prior and the query representation\. In practice,TTis bounded by the candidate role pool, and unselected roles are omitted from the generated graph\. This autoregressive design enablesTopoPriorto produce sparse, query\-adaptive initial collaboration graphs for downstream refinement\.

Conditional Prior\.The conditional priorpθ′​\(z∣q\)p^\{\\prime\}\_\{\\theta\}\(z\\mid q\)provides a query\-specific latent prior for topology generation:

pθ′​\(z∣q\)=𝒩​\(z;fprior​\(𝐡q\),I\),p^\{\\prime\}\_\{\\theta\}\(z\\mid q\)=\\mathcal\{N\}\\\!\\left\(z;f\_\{\\mathrm\{prior\}\}\(\\mathbf\{h\}\_\{q\}\),I\\right\),\(9\)wherefpriorf\_\{\\mathrm\{prior\}\}is a trainable MLP\. Since reference graphs are unavailable at inference time, this conditional prior plays an important role in mapping unseen queries to useful regions of the topology latent space\. The resulting topology\-prior learning objective is

ℒprior=−𝔼z∼qϕ​\(z∣𝔾,q\)​\[log⁡pθ​\(𝔾∣z,q\)\]\+KL\(qϕ\(z∣𝔾,q\)∥pθ′\(z∣q\)\)\.\\mathcal\{L\}\_\{\\mathrm\{prior\}\}=\-\\mathbb\{E\}\_\{z\\sim q\_\{\\phi\}\(z\\mid\\mathbb\{G\},q\)\}\\big\[\\log p\_\{\\theta\}\(\\mathbb\{G\}\\mid z,q\)\\big\]\\\\ \+\\mathrm\{KL\}\\\!\\left\(q\_\{\\phi\}\(z\\mid\\mathbb\{G\},q\)\\,\\\|\\,p^\{\\prime\}\_\{\\theta\}\(z\\mid q\)\\right\)\.\(10\)

### III\-CQuery\-Conditioned Latent Adaptation

While transferable topology priors can improve initialization efficiency, effective collaboration still requires sensitivity to query\-relevant structural cues that may vary across domains\. To this end, we introduce a domain\-adversarial discriminator on the latent variablezzto reduce unnecessary domain discrepancy during training\. The discriminator is defined as

fΨ​\(z\)=fd​\(𝐖d​z\+𝐛d\),\\displaystyle f\_\{\\Psi\}\(z\)=f\_\{d\}\(\\mathbf\{W\}\_\{d\}z\+\\mathbf\{b\}\_\{d\}\),\(11\)ℒadapt=−𝔼\(𝔾,q,yk\)∼𝒟​\[yk⊤​log⁡fΨ​\(z\)\],\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{adapt\}\}=\-\\mathbb\{E\}\_\{\(\\mathbb\{G\},q,y^\{k\}\)\\sim\\mathcal\{D\}\}\\left\[y^\{k\\top\}\\log f\_\{\\Psi\}\(z\)\\right\],\(12\)wherefdf\_\{d\}is the softmax function and a gradient reversal layer \(GRL\)\[[12](https://arxiv.org/html/2605.17359#bib.bib46)\]is applied during training\. Combined with query\-conditioned prior learning and graph reconstruction, this adversarial objective reduces cross\-domain discrepancy in the latent space, while the conditional prior and reconstruction objective help preserve query\-dependent and task\-relevant structural information for topology generation\.

Algorithm 1TopoPriorTraining Process0:Multi\-domain training set

𝒟=⋃k=1K𝒟k\\mathcal\{D\}=\\bigcup\_\{k=1\}^\{K\}\\mathcal\{D\}^\{k\}, hyperparameter

α\\alphaand

β\\beta, learning rate

η\\eta, batch size

BB, number of epochs

EE
0:TrainedTopoPriorparameters

θ\\theta,

ϕ\\phi,

Ψ\\Psi
1:Initialize variational encoder

qϕ​\(z\|𝔾,q\)q\_\{\\phi\}\(z\|\\mathbb\{G\},q\)with GCN layers

2:Initialize conditional generator

pθ​\(𝔾\|z,q\)p\_\{\\theta\}\(\\mathbb\{G\}\|z,q\), prior network

pθ′​\(z\|q\)p^\{\\prime\}\_\{\\theta\}\(z\|q\), and domain discriminator

fΨ​\(z\)f\_\{\\Psi\}\(z\)
3:forepoch

=1=1to

EEdo

4:forbatch

\(qi,𝔾i,di\)∼𝒟\(q\_\{i\},\\mathbb\{G\}\_\{i\},d\_\{i\}\)\\sim\\mathcal\{D\}do

5:Step 1: Encode graph and query

6:Compute node representations

𝐡v\\mathbf\{h\}\_\{v\}via GCN layers

7:Aggregate graph representation

𝐡𝔾=∑v𝐡v\\mathbf\{h\}\_\{\\mathbb\{G\}\}=\\sum\_\{v\}\\mathbf\{h\}\_\{v\}
8:Encode query

qiq\_\{i\}into

𝐡q\\mathbf\{h\}\_\{q\}
9:Fuse query and graph features:

𝐡task=MLPtask​\(\[𝐡q∥𝐡𝔾\]\)\\mathbf\{h\}\_\{\\text\{task\}\}=\\mathrm\{MLP\}\_\{\\text\{task\}\}\(\[\\mathbf\{h\}\_\{q\}\\parallel\\mathbf\{h\}\_\{\\mathbb\{G\}\}\]\)
10:Sample latent variable

z∼qϕ​\(z\|𝔾,q\)z\\sim q\_\{\\phi\}\(z\|\\mathbb\{G\},q\)via reparameterization

11:Step 2: Reconstruct graph

12:Generate nodes and edges autoregressively via

pθ​\(𝔾\|z,q\)p\_\{\\theta\}\(\\mathbb\{G\}\|z,q\)
13:Step 3: Compute losses

14:

ℒrecon=−𝔼z∼qϕ​\[log⁡pθ​\(𝔾\|z,q\)\]\\mathcal\{L\}\_\{\\text\{recon\}\}=\-\\mathbb\{E\}\_\{z\\sim q\_\{\\phi\}\}\[\\log p\_\{\\theta\}\(\\mathbb\{G\}\|z,q\)\]
15:

ℒKL=KL\(qϕ\(z\|𝔾,q\)∥pθ′\(z\|q\)\)\\mathcal\{L\}\_\{\\text\{KL\}\}=\\mathrm\{KL\}\(q\_\{\\phi\}\(z\|\\mathbb\{G\},q\)\\\|p^\{\\prime\}\_\{\\theta\}\(z\|q\)\)
16:

ℒadapt=−𝔼​\[y⋅log⁡fΨ​\(z\)\]\\mathcal\{L\}\_\{\\text\{adapt\}\}=\-\\mathbb\{E\}\[y\\cdot\\log f\_\{\\Psi\}\(z\)\]
17:Step 4: Update parameters

18:

ℒprior=ℒrecon\+ℒKL\\mathcal\{L\}\_\{\\text\{prior\}\}=\\mathcal\{L\}\_\{\\text\{recon\}\}\+\\mathcal\{L\}\_\{\\text\{KL\}\}
19:

ℒtotal=ℒprior\+α​ℒtask\+β​ℒadapt\\mathcal\{L\}\_\{\\text\{total\}\}=\\mathcal\{L\}\_\{\\text\{prior\}\}\+\\alpha\\mathcal\{L\}\_\{\\mathrm\{task\}\}\+\\beta\\mathcal\{L\}\_\{\\text\{adapt\}\}
20:Update

θ,ϕ,Ψ\\theta,\\phi,\\Psivia

∇θ,ϕ,Ψℒtotal\\nabla\_\{\\theta,\\phi,\\Psi\}\\mathcal\{L\}\_\{\\text\{total\}\}
21:endfor

22:endfor

### III\-DTraining and Inference

The training objective ofTopoPriorcombines topology\-prior learning and latent\-space alignment:

ℒtotal=ℒprior\+α​ℒtask\+β​ℒadapt,\\mathcal\{L\}\_\{\\mathrm\{total\}\}=\\mathcal\{L\}\_\{\\mathrm\{prior\}\}\+\\alpha\\,\\mathcal\{L\}\_\{\\mathrm\{task\}\}\+\\beta\\,\\mathcal\{L\}\_\{\\mathrm\{adapt\}\},\(13\)whereα,β≥0\\alpha,\\beta\\geq 0are hyperparameters, andℒtask\\mathcal\{L\}\_\{\\mathrm\{task\}\}denotes the loss of the underlying task\.

During training,TopoPriorlearns transferable topology priors from reference graphs constructed on the training domains\. During inference, given a new queryqq, we first samplez∼pθ′​\(z∣q\)z\\sim p^\{\\prime\}\_\{\\theta\}\(z\\mid q\)and then generate an initial collaboration graph𝔾^∼pθ​\(𝔾∣z,q\)\\hat\{\\mathbb\{G\}\}\\sim p\_\{\\theta\}\(\\mathbb\{G\}\\mid z,q\)\. The generated graph is subsequently passed to a downstream topology\-evolution backbone, which refines it according to its original task\-specific optimization process\. In this way,TopoPriorcomplements existing topology\-evolution methods by providing lightweight and transferable graph initialization rather than replacing their task\-specific search mechanisms\. The training procedure forTopoPrioris summarized in Algorithm[1](https://arxiv.org/html/2605.17359#alg1)\.

### III\-EAnalytical Perspective

We next provide an analytical perspective on the two design principles underlyingTopoPrior\. The goal of this subsection is not to establish end\-to\-end guarantees for LLM\-based multi\-agent topology evolution, but rather to formalize the intuition behind latent alignment and topology initialization using standard analytical tools\. Since both arguments are adapted from classical generalization and convergence analyses, we present concise proof sketches specialized to our setting\. Detailed proofs are deferred to Appendix[A](https://arxiv.org/html/2605.17359#A1)\.

#### III\-E1Cross\-Domain Transfer via Latent Alignment

Our framework uses adversarial latent alignment to reduce avoidable domain discrepancy across source domains, which may improve transfer to unseen domains\. LetP1,…,PKP\_\{1\},\\dots,P\_\{K\}denote the marginal latent distributions of theKKsource domains induced by the encoder, and letPtP\_\{t\}denote the marginal latent distribution of a target domain\. For any hypothesishhin the classℋ\\mathcal\{H\}, define the expected error on domainddas

ϵd​\(h\)=𝔼z∼Pd​\[ℓh​\(z\)\],\\epsilon\_\{d\}\(h\)=\\mathbb\{E\}\_\{z\\sim P\_\{d\}\}\[\\ell\_\{h\}\(z\)\],\(14\)whereℓh\\ell\_\{h\}is a bounded loss function\.

We measure domain discrepancy in the latent space by theℋ​Δ​ℋ\\mathcal\{H\}\\Delta\\mathcal\{H\}\-divergence:

dℋ​Δ​ℋ​\(P,Q\)=2​suph,h′∈ℋ\|Prz∼P⁡\[h​\(z\)≠h′​\(z\)\]−Prz∼Q\[h\(z\)≠h′\(z\)\]\|\.d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P,Q\)=2\\sup\_\{h,h^\{\\prime\}\\in\\mathcal\{H\}\}\\big\|\\Pr\_\{z\\sim P\}\[h\(z\)\\neq h^\{\\prime\}\(z\)\]\\\\ \-\\Pr\_\{z\\sim Q\}\[h\(z\)\\neq h^\{\\prime\}\(z\)\]\\big\|\.\(15\)
By adapting standard multi\-source domain adaptation analysis\[[58](https://arxiv.org/html/2605.17359#bib.bib79),[47](https://arxiv.org/html/2605.17359#bib.bib81)\]to the encoder\-induced latent space, we obtain the following bound\.

###### Theorem 1\(Multi\-Source Domain Adaptation Bound in Latent Space\)\.

Letα∈ΔK−1\\alpha\\in\\Delta^\{K\-1\}be convex weights over source domains, and letPα=∑k=1Kαk​PkP\_\{\\alpha\}=\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}P\_\{k\}\. Then, for anyh∈ℋh\\in\\mathcal\{H\},

ϵt​\(h\)≤∑k=1Kαk​ϵk​\(h\)\+12​dℋ​Δ​ℋ​\(Pα,Pt\)\+λα,t∗,\\epsilon\_\{t\}\(h\)\\leq\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\\epsilon\_\{k\}\(h\)\+\\frac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{\\alpha\},P\_\{t\}\)\+\\lambda^\{\*\}\_\{\\alpha,t\},\(16\)where

λα,t∗=minh∈ℋ⁡\(∑k=1Kαk​ϵk​\(h\)\+ϵt​\(h\)\)\.\\lambda^\{\*\}\_\{\\alpha,t\}=\\min\_\{h\\in\\mathcal\{H\}\}\\left\(\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\\epsilon\_\{k\}\(h\)\+\\epsilon\_\{t\}\(h\)\\right\)\.\(17\)

The result follows from applying standard multi\-source domain adaptation arguments to the encoder\-induced latent distributions and replacing input\-space discrepancy with the correspondingℋ​Δ​ℋ\\mathcal\{H\}\\Delta\\mathcal\{H\}\-divergence in latent space\[[58](https://arxiv.org/html/2605.17359#bib.bib79),[47](https://arxiv.org/html/2605.17359#bib.bib81)\]\. Theorem[1](https://arxiv.org/html/2605.17359#Thmtheorem1)therefore provides a principled perspective on why reducing latent\-space domain discrepancy may be beneficial for cross\-domain transfer inTopoPrior\. We emphasize that this result should be interpreted as motivation for latent\-space regularization rather than as a task\-specific guarantee for the full downstream system\.

#### III\-E2Topology Initialization as Search Acceleration

Our topology initializer is designed to provide stronger starting collaboration graphs, thereby reducing the number of refinement rounds required by downstream topology evolution\. Let

Jλ​\(𝔾;q,y\)=Perf​\(𝔾;q,y\)−λ​C​\(𝔾;q\),J\_\{\\lambda\}\(\\mathbb\{G\};q,y\)=\\mathrm\{Perf\}\(\\mathbb\{G\};q,y\)\-\\lambda C\(\\mathbb\{G\};q\),\(18\)denote a utility function that combines task performance and communication cost, whereλ≥0\\lambda\\geq 0controls the trade\-off between the two terms\. Define

Ut=𝔼​\[Jλ​\(𝔾t;q,y\)\],U\_\{t\}=\\mathbb\{E\}\[J\_\{\\lambda\}\(\\mathbb\{G\}\_\{t\};q,y\)\],\(19\)as the expected utility at evolution steptt, starting from an initial graph𝔾0\\mathbb\{G\}\_\{0\}\.

To analyze the role of initialization, we adopt an abstract contraction\-style assumption commonly used in convergence analysis of iterative optimization procedures\[[40](https://arxiv.org/html/2605.17359#bib.bib109)\]\. Although downstream topology evolution in our setting is discrete and LLM\-mediated, this assumption is used only as a simplified analytical model for studying the effect of initialization quality\.

###### Assumption 1\(Linear Convergence\)\.

There existsη∈\(0,1\]\\eta\\in\(0,1\]such that, for alltt,

U∗−Ut\+1≤\(1−η\)​\(U∗−Ut\),U^\{\*\}\-U\_\{t\+1\}\\leq\(1\-\\eta\)\(U^\{\*\}\-U\_\{t\}\),\(20\)whereU∗U^\{\*\}is the optimal achievable utility\.

Under Assumption[1](https://arxiv.org/html/2605.17359#Thmassumption1), we obtain the following round\-complexity bound\.

###### Theorem 2\(Rounds toϵ\\epsilon\-Suboptimality\)\.

Under Assumption[1](https://arxiv.org/html/2605.17359#Thmassumption1), to ensureU∗−UT≤ϵU^\{\*\}\-U\_\{T\}\\leq\\epsilon, it suffices that

T≥log⁡\(\(U∗−U0\)/ϵ\)log⁡\(1/\(1−η\)\),T\\geq\\frac\{\\log\(\(U^\{\*\}\-U\_\{0\}\)/\\epsilon\)\}\{\\log\(1/\(1\-\\eta\)\)\},\(21\)

The bound is obtained by recursively unrolling Assumption[1](https://arxiv.org/html/2605.17359#Thmassumption1), which givesU∗−UT≤\(1−η\)T​\(U∗−U0\)U^\{\*\}\-U\_\{T\}\\leq\(1\-\\eta\)^\{T\}\(U^\{\*\}\-U\_\{0\}\), and then solving forTT\.

###### Corollary 1\(Better Initialization Reduces Rounds\)\.

If the prior\-based initialization satisfiesU0prior\>U0scratchU\_\{0\}^\{\\mathrm\{prior\}\}\>U\_\{0\}^\{\\mathrm\{scratch\}\}, then for anyϵ\\epsilon,

Tprior​\(ϵ\)−Tscratch​\(ϵ\)≤log⁡\(U∗−U0priorU∗−U0scratch\)log⁡\(1/\(1−η\)\)<0\.T\_\{\\mathrm\{prior\}\}\(\\epsilon\)\-T\_\{\\mathrm\{scratch\}\}\(\\epsilon\)\\leq\\frac\{\\log\\\!\\left\(\\frac\{U^\{\*\}\-U\_\{0\}^\{\\mathrm\{prior\}\}\}\{U^\{\*\}\-U\_\{0\}^\{\\mathrm\{scratch\}\}\}\\right\)\}\{\\log\(1/\(1\-\\eta\)\)\}<0\.\(22\)

The corollary follows directly by comparing the bounds induced byU0priorU\_\{0\}^\{\\mathrm\{prior\}\}andU0scratchU\_\{0\}^\{\\mathrm\{scratch\}\}in Theorem[2](https://arxiv.org/html/2605.17359#Thmtheorem2)\. It formalizes the intuition that better initialization can reduce the search cost of downstream topology evolution under the stated assumption\. This interpretation is consistent with the lower token usage and fewer communication rounds observed in our experiments, although the result should be viewed as an analytical perspective rather than a literal guarantee for multi\-agent LLM systems\.

TABLE I:Test\-set statistics under the domain partitions used in our experiments\. For training, we follow the official train/validation splits provided by MMLU and C\-Eval\.DomainTest SizeDomainTest SizeMMLUNatural Sciences6696Ethics and Morality558Engineering and Technology1116Business and Management1453Social Sciences2072Humanities and History1674Law, Government,and Public Affairs1825C\-EvalNatural Sciences2312Vocational Qualificationsand Professional Examinations1581Engineering and Technology2424Medicine and Life Sciences1227Social Sciences and Humanities4618C\-Eval HardMathematics835Chemistry581Physics529

## IVExperiments

In this section, we evaluateTopoPrioron multi\-domain reasoning benchmarks and compare it with representative baselines, with a particular focus on topology\-evolution methods\. We report both downstream task performance and efficiency\-related metrics, including online inference\-time token usage and communication rounds\. We further conduct ablation, sensitivity, and generalization analyses to examine the effectiveness of transferable topology prior learning under different settings\.

### IV\-ADatasets

We evaluateTopoPrioron two widely used LLM benchmarks, MMLU\[[15](https://arxiv.org/html/2605.17359#bib.bib78)\]and C\-Eval\[[21](https://arxiv.org/html/2605.17359#bib.bib26)\], under multi\-domain settings\. To construct domains for topology\-prior learning, we group fine\-grained benchmark subcategories into a smaller number of major disciplinary categories according to semantic relatedness and the original benchmark taxonomy\. The resulting test\-set statistics are summarized in Table[I](https://arxiv.org/html/2605.17359#S3.T1)\.

MMLU\.MMLU contains 57 subject\-level tasks\. We organize them into seven major domains according to their disciplinary themes:\(1\) Natural Sciences,\(2\) Engineering and Technology,\(3\) Social Sciences,\(4\) Humanities and History,\(5\) Law, Government, and Public Affairs,\(6\) Ethics and Morality,and\(7\) Business and Management\.Together, these domains cover quantitative reasoning, technical problem solving, legal and policy analysis, ethical judgment, and knowledge\-intensive social and historical understanding\.

C\-Eval\.C\-Eval contains 52 subject\-level tasks designed to assess the general understanding capability of Chinese LLMs\. We group these tasks into five major domains:\(1\) Natural Sciences,\(2\) Engineering and Technology,\(3\) Social Sciences and Humanities,\(4\) Vocational Qualifications and Professional Examinations,and\(5\) Medicine and Life Sciences\.These categories cover both academic disciplines and professional qualification scenarios, providing a diverse testbed for multi\-domain collaboration\.

C\-Eval Hard\.To further evaluate robustness on challenging scientific reasoning tasks, we construct*C\-Eval Hard*by extracting three natural\-science subdomains from C\-Eval:\(1\) Mathematics,\(2\) Chemistry,and\(3\) Physics\.This benchmark focuses on difficult quantitative reasoning and serves as an additional evaluation setting for cross\-domain transfer under domain variation\.

### IV\-BBaselines

Training\-free methods\.We consider vanilla prompt engineering \(PE\) with Llama3\-8B\-Instruct\[[48](https://arxiv.org/html/2605.17359#bib.bib62)\], Qwen2\.5\-72B\-Instruct\[[53](https://arxiv.org/html/2605.17359#bib.bib66)\], and DeepSeek\-V3\-671B\-Instruct\[[7](https://arxiv.org/html/2605.17359#bib.bib65)\]as representative single\-agent backbones\. We further include Chain\-of\-Thought \(CoT\)\[[51](https://arxiv.org/html/2605.17359#bib.bib54)\]and Retrieval\-Augmented Generation \(RAG\)\[[28](https://arxiv.org/html/2605.17359#bib.bib99)\]as stronger prompting\-based baselines\.

Training\-intensive methods\.We compare with representative multi\-domain adaptation methods, including MoDULA\[[38](https://arxiv.org/html/2605.17359#bib.bib20)\], MoDE\[[45](https://arxiv.org/html/2605.17359#bib.bib103)\], and DES\-MoE\[[32](https://arxiv.org/html/2605.17359#bib.bib30)\]\. These methods improve domain specialization through parameter\-efficient fine\-tuning or mixture\-of\-experts architectures\.

Training\-light multi\-agent methods\.We evaluateTopoPrioron top of several topology\-evolution backbones, including G\-Designer\[[55](https://arxiv.org/html/2605.17359#bib.bib100)\], AgentPrune\[[54](https://arxiv.org/html/2605.17359#bib.bib8)\], ARG\-Designer\[[33](https://arxiv.org/html/2605.17359#bib.bib47)\], and AgentDropout\[[50](https://arxiv.org/html/2605.17359#bib.bib39)\]\. These methods dynamically optimize collaboration topologies through graph learning, pruning, autoregressive generation, or agent/edge dropout, respectively\. Our method is complementary to these approaches, as it provides transferable graph initialization while leaving their downstream topology\-refinement procedures unchanged\.

TABLE II:Agent roles used inTopoPrior, aligned with the domain partitions of MMLU and C\-Eval\. Roles are selected dynamically according to the query content and task context\.Role NameDomainRole DescriptionSub\-domainNatural Science ExpertNatural SciencesProvides knowledge in physics, chemistry, biology, and medicine, and handles formal reasoning with domain\-specific terminology\.Math, Physics, Chemistry, BiologyEngineering SpecialistEngineering & TechnologySolves queries in computer science, algorithms, security, and systems design using structured technical reasoning\.Computer Science, ML, SecuritySocial ScientistSocial SciencesAnalyzes economic, psychological, sociological, and political concepts, and performs causal and qualitative reasoning\.Economics, Psychology, Political ScienceHumanities ScholarHumanities & HistoryInterprets historical events, philosophical arguments, cultural contexts, and ethical frameworks\.History, Philosophy, World ReligionsLegal AnalystLaw, Government, and Public AffairsInterprets legal texts, statutes, treaties, and policy documents, and performs normative and juridical reasoning\.Jurisprudence, International LawEthics ConsultantEthics & MoralityEvaluates moral dilemmas, ethical scenarios, and value\-based judgments, especially in subjective contexts\.Moral Controversies, Ethical ScenariosBusiness StrategistBusiness & ManagementAnalyzes business ethics, accounting, marketing, and management strategies by combining normative and financial reasoning\.Business Ethics, Accounting, MarketingMathematical ExpertMathematicsSolves mathematical queries, including discrete mathematics, probability, statistics, and algebra\.Discrete Math, Probability, StatisticsChemistry SpecialistChemistryAnswers chemistry questions at the high\-school and college levels, and explains chemical reactions, properties, and theories\.General Chemistry, Organic ChemistryPhysics SpecialistPhysicsHandles physics questions at the high\-school and college levels, and applies principles from mechanics, electromagnetism, and related areas\.Mechanics, ElectromagnetismMedical Life ScientistMedicalProvides knowledge in clinical medicine, veterinary science, agronomy, and plant sciences\.Clinical Medicine, Veterinary ScienceVocational ExaminerVocational QualificationsAnswers certification\-oriented questions in finance, taxation, civil service, and tourism\.CPA, Tax Agent, Civil ServiceGeneral CoordinatorMulti\-domainOrchestrates collaboration, aggregates outputs, and manages communication flow among specialized agents\.All domains \(coordination role\)TABLE III:Results on MMLU across domains with different LLM backbones\. “GD”, “AP”, “ARG”, and “AD” denote G\-Designer, AgentPrune, ARG\-Designer, and AgentDropout, respectively\. Best results are in bold and second\-best results are underlined\.Domain→\\quad\\rightarrowNaturalSciencesEngineeringTechnologySocialSciencesHumanitiesHistoryLaw, GovernmentPublic AffairsEthicsMoralityBusinessManagementAverageModels↓\\quad\\downarrowBase model: Llama3\-8B\-InstructPE57\.8458\.8956\.6160\.0549\.8445\.7259\.4055\.05CoT58\.97\(↑1\.13\)61\.51\(↑2\.62\)59\.02\(↑2\.41\)62\.48\(↑2\.43\)50\.65\(↑0\.81\)46\.76\(↑1\.04\)60\.68\(↑1\.28\)57\.15RAG61\.65\(↑3\.81\)62\.39\(↑3\.50\)62\.95\(↑6\.34\)65\.37\(↑5\.32\)58\.03\(↑8\.19\)54\.50\(↑8\.78\)62\.83\(↑3\.43\)61\.10MoDULA56\.89\(↓0\.95\)58\.72\(↓0\.17\)57\.95\(↑1\.34\)61\.22\(↑1\.17\)51\.60\(↑1\.76\)48\.42\(↑2\.70\)60\.53\(↑1\.13\)56\.47MoDE59\.72\(↑1\.88\)63\.05\(↑4\.16\)62\.45\(↑5\.84\)64\.69\(↑4\.64\)56\.79\(↑6\.95\)54\.11\(↑8\.39\)61\.06\(↑1\.66\)60\.12DES\-MoE58\.88\(↑1\.04\)62\.34\(↑3\.45\)60\.70\(↑4\.09\)62\.63\(↑2\.58\)54\.84\(↑5\.00\)52\.98\(↑7\.26\)61\.33\(↑1\.93\)59\.10G\-Designer64\.82\(↑6\.98\)66\.57\(↑7\.68\)63\.91\(↑7\.30\)67\.70\(↑7\.65\)61\.58\(↑11\.74\)59\.13\(↑13\.41\)65\.44\(↑6\.04\)64\.16AgentPrune63\.99\(↑6\.15\)66\.85\(↑7\.96\)64\.73\(↑8\.12\)66\.98\(↑6\.93\)63\.07\(↑13\.23\)58\.51\(↑12\.79\)64\.12\(↑4\.72\)64\.04ARG\-Designer65\.48\(↑7\.64\)68\.01\(↑9\.12\)64\.95\(↑8\.34\)69\.54\(↑9\.49\)63\.86\(↑14\.02\)61\.32\(↑15\.60\)65\.87\(↑6\.47\)65\.58AgentDropout64\.30\(↑6\.46\)69\.72\(↑10\.83\)63\.14\(↑6\.53\)68\.86\(↑8\.81\)65\.47\(↑15\.63\)62\.05\(↑16\.33\)66\.53\(↑7\.13\)65\.72TopoPrior\+GD66\.32\(↑8\.48\)67\.80\(↑8\.91\)65\.42\(↑8\.81\)70\.41\(↑10\.36\)63\.96\(↑14\.12\)60\.85\(↑15\.13\)67\.18\(↑7\.78\)65\.99TopoPrior\+AP65\.34\(↑7\.50\)68\.75\(↑9\.86\)65\.19\(↑8\.58\)69\.16\(↑9\.11\)63\.93\(↑14\.09\)61\.80\(↑16\.08\)67\.62\(↑8\.22\)65\.97TopoPrior\+ARG68\.91\(↑11\.07\)71\.57\(↑12\.68\)68\.13\(↑11\.52\)72\.74\(↑12\.69\)66\.15\(↑16\.31\)64\.08\(↑18\.36\)68\.14\(↑8\.74\)68\.53TopoPrior\+AD65\.70\(↑7\.86\)69\.95\(↑11\.06\)65\.47\(↑8\.86\)70\.36\(↑10\.31\)66\.88\(↑17\.04\)63\.75\(↑18\.03\)69\.05\(↑9\.65\)67\.31Base model: DeepSeek\-V3\-671B\-InstructPE84\.9685\.2384\.7886\.4179\.6378\.9285\.1284\.47CoT84\.60\(↓0\.36\)84\.85\(↓0\.38\)85\.32\(↑0\.54\)86\.87\(↑0\.46\)80\.19\(↑0\.56\)79\.52\(↑0\.60\)85\.57\(↑0\.45\)84\.90RAG85\.92\(↑0\.96\)86\.17\(↑0\.94\)85\.84\(↑1\.06\)87\.63\(↑1\.22\)82\.45\(↑2\.82\)81\.78\(↑2\.86\)86\.05\(↑0\.93\)85\.38G\-Designer89\.23\(↑4\.27\)89\.67\(↑4\.44\)88\.91\(↑4\.13\)91\.08\(↑4\.67\)84\.12\(↑4\.49\)83\.46\(↑4\.54\)89\.35\(↑4\.23\)88\.52AgentPrune89\.84\(↑4\.88\)90\.03\(↑4\.80\)89\.52\(↑4\.74\)91\.47\(↑5\.06\)84\.73\(↑5\.10\)83\.98\(↑5\.06\)89\.76\(↑4\.64\)89\.19ARG\-Designer91\.15\(↑6\.19\)91\.80\(↑6\.57\)91\.07\(↑6\.29\)93\.69\(↑7\.28\)86\.28\(↑6\.65\)85\.52\(↑6\.60\)91\.48\(↑6\.36\)90\.14AgentDropout89\.97\(↑5\.01\)90\.50\(↑5\.27\)90\.11\(↑5\.33\)91\.93\(↑5\.52\)85\.29\(↑5\.66\)84\.89\(↑5\.97\)90\.25\(↑5\.13\)88\.99TopoPrior\+GD90\.12\(↑5\.16\)90\.46\(↑5\.23\)89\.82\(↑5\.04\)92\.37\(↑5\.96\)85\.43\(↑5\.80\)84\.71\(↑5\.79\)90\.28\(↑5\.16\)89\.89TopoPrior\+AP90\.75\(↑5\.79\)90\.90\(↑5\.67\)90\.41\(↑5\.63\)92\.89\(↑6\.48\)86\.01\(↑6\.38\)85\.31\(↑6\.39\)90\.65\(↑5\.53\)90\.41TopoPrior\+ARG92\.87\(↑7\.91\)93\.14\(↑7\.91\)92\.55\(↑7\.77\)95\.12\(↑8\.71\)87\.83\(↑8\.20\)87\.08\(↑8\.16\)92\.79\(↑7\.67\)92\.03TopoPrior\+AD91\.78\(↑6\.82\)92\.09\(↑6\.86\)91\.33\(↑6\.55\)93\.95\(↑7\.54\)86\.95\(↑7\.32\)86\.51\(↑7\.59\)91\.74\(↑6\.62\)90\.62Base model: Qwen2\.5\-72B\-InstructPE84\.1784\.4483\.8685\.5379\.8479\.1983\.9283\.26CoT85\.02\(↑0\.85\)84\.40\(↓0\.04\)84\.79\(↑0\.93\)86\.71\(↑1\.18\)81\.17\(↑1\.33\)80\.42\(↑1\.23\)85\.07\(↑1\.15\)84\.41RAG85\.71\(↑1\.54\)86\.02\(↑1\.58\)85\.48\(↑1\.62\)87\.43\(↑1\.90\)82\.33\(↑2\.49\)81\.58\(↑2\.39\)85\.73\(↑1\.81\)85\.56G\-Designer87\.42\(↑3\.25\)87\.55\(↑3\.11\)87\.20\(↑3\.34\)88\.97\(↑3\.44\)83\.19\(↑3\.35\)82\.58\(↑3\.39\)87\.43\(↑3\.51\)86\.33AgentPrune86\.67\(↑2\.50\)86\.98\(↑2\.54\)86\.44\(↑2\.58\)88\.42\(↑2\.89\)82\.56\(↑2\.72\)81\.81\(↑2\.62\)86\.69\(↑2\.77\)85\.71ARG\-Designer88\.50\(↑4\.33\)88\.87\(↑4\.43\)88\.41\(↑4\.55\)90\.40\(↑4\.87\)84\.49\(↑4\.65\)83\.55\(↑4\.36\)88\.69\(↑4\.77\)87\.56AgentDropout87\.63\(↑3\.46\)87\.79\(↑3\.35\)87\.40\(↑3\.54\)89\.32\(↑3\.79\)83\.59\(↑3\.75\)82\.70\(↑3\.51\)87\.74\(↑3\.82\)86\.60TopoPrior\+GD88\.31\(↑4\.14\)88\.44\(↑4\.00\)88\.13\(↑4\.27\)90\.01\(↑4\.48\)84\.30\(↑4\.46\)83\.29\(↑4\.10\)88\.32\(↑4\.40\)87\.26TopoPrior\+AP87\.74\(↑3\.57\)87\.61\(↑3\.27\)87\.49\(↑3\.63\)89\.33\(↑3\.80\)83\.58\(↑3\.74\)82\.73\(↑3\.54\)87\.64\(↑3\.72\)86\.59TopoPrior\+ARG90\.87\(↑6\.70\)91\.18\(↑6\.74\)90\.64\(↑6\.78\)92\.62\(↑7\.09\)86\.76\(↑6\.92\)86\.01\(↑6\.82\)90\.89\(↑6\.97\)89\.56TopoPrior\+AD89\.48\(↑5\.31\)89\.62\(↑5\.18\)89\.35\(↑5\.49\)91\.26\(↑5\.73\)85\.45\(↑5\.61\)84\.75\(↑5\.56\)89\.89\(↑5\.97\)88\.54
### IV\-CImplementation Details

We implementTopoPriorin PyTorch and conduct experiments on two NVIDIA A800 GPUs\. Unless otherwise specified,TopoPrioris trained separately on each benchmark under its corresponding multi\-domain partition\.

Backbones and encoders\.Agent roles are instantiated using LLM backbones, including the locally deployed Llama3\-8B\-Instruct\[[48](https://arxiv.org/html/2605.17359#bib.bib62)\]and the online models Qwen2\.5\-72B\-Instruct\[[53](https://arxiv.org/html/2605.17359#bib.bib66)\]and DeepSeek\-V3\-671B\-Instruct\[[7](https://arxiv.org/html/2605.17359#bib.bib65)\]\. The query representation𝐡q\\mathbf\{h\}\_\{q\}and the initial agent\-role representation𝐡v\(0\)\\mathbf\{h\}\_\{v\}^\{\(0\)\}are encoded using a frozen BERT encoder\[[8](https://arxiv.org/html/2605.17359#bib.bib92)\]\. Unless otherwise stated, gold domain labels are used only for benchmark partitioning and analysis and are not provided toTopoPriorat inference time\.

TopoPriorarchitecture\.The variational encoder is implemented as a two\-layer GCN\[[25](https://arxiv.org/html/2605.17359#bib.bib38)\]with hidden sized𝐡v=256d\_\{\\mathbf\{h\}\_\{v\}\}=256, latent dimensiondz=128d\_\{z\}=128, and ReLU activation\[[1](https://arxiv.org/html/2605.17359#bib.bib37)\]\. Linear projections are implemented with two\-layer MLPs\. For node and edge history encoding, we use a two\-layer GRU\[[6](https://arxiv.org/html/2605.17359#bib.bib19)\]\. The latent\-space discriminator is implemented as a two\-layer MLP with a softmax output layer\. The edge\-generation threshold is set toδe=0\.5\\delta\_\{e\}=0\.5, and the gradient reversal layer uses a coefficient of−0\.1\-0\.1\.

Training protocol\.We optimize the model with Adam using a learning rate of2×10−42\\times 10^\{\-4\}and a batch size of 32\. The alignment coefficients are set toα=0\.5\\alpha=0\.5,β=0\.5\\beta=0\.5, and the topology\-prior generator is trained for five epochs\. Reference collaboration graphs are constructed offline on the training split using AgentDropout\[[50](https://arxiv.org/html/2605.17359#bib.bib39)\]\. During evaluation, the generated initial graph is passed to the corresponding downstream topology\-evolution backbone for task\-specific refinement\. For fair comparison, integratingTopoPriorchanges only the initialization stage of each backbone while keeping its downstream refinement procedure and search budget unchanged, unless otherwise specified\.

Role pool and graph construction\.The conditional generatorpθ​\(𝔾∣z,q\)p\_\{\\theta\}\(\\mathbb\{G\}\\mid z,q\)autoregressively selects agent nodes from an extensible role pool\. Table[II](https://arxiv.org/html/2605.17359#S4.T2)summarizes the agent roles associated with the major domains in MMLU and C\-Eval\. For natural\-science tasks, we include both a genericNatural Science Expertand specialized roles \(e\.g\.,Mathematical Expert,Chemistry Specialist, andPhysics Specialist\) to better support challenging subdomains\. For efficient offline supervision construction, we use AgentDropout\[[50](https://arxiv.org/html/2605.17359#bib.bib39)\]to obtain reference communication topologies, as it is substantially faster than more iterative topology\-construction methods on large training sets\. Each agent nodevi,jk=\{LLM,Role,State,Plugins\}v\_\{i,j\}^\{k\}=\\\{\\texttt\{LLM\},\\texttt\{Role\},\\texttt\{State\},\\texttt\{Plugins\}\\\}encodes the underlying language model, assigned role, current state, and any attached tools or plugins\[[54](https://arxiv.org/html/2605.17359#bib.bib8)\]\.

Evaluation protocol\.Unless otherwise specified, we report classification accuracy \(Acc\.\) on domain\-specific test splits\. For efficiency analysis, token consumption is measured as the total number of LLM input and output tokens incurred during online inference, including inter\-agent communication and final answer generation, but excluding the offline cost of reference\-graph construction and topology\-prior training\. We report these metrics as online inference efficiency and discuss offline supervision and training cost separately when relevant\. For ARG\-Designer\[[33](https://arxiv.org/html/2605.17359#bib.bib47)\], which generates graph topologies autoregressively from scratch, we incorporate the latent representation fromTopoPriorinto its initial query encoding \(i\.e\.,𝐟𝒬\\mathbf\{f\}\_\{\\mathcal\{Q\}\}in its Eq\. \(6\)\) as the initialization signal\.

TABLE IV:Results on C\-Eval and C\-Eval Hard across domains with LLM backbones \. “GD”, “AP”, “ARG”, and “AD” denote G\-Designer, AgentPrune, ARG\-Designer, and AgentDropout\. Best results are in bold and second\-best results are underlined\.Domain→\\quad\\rightarrowNaturalSciencesEngineeringTechnologySocial SciencesHumanitiesVocational Qualif\.Prof\. Exam\.MedicineLife Sci\.C\-Eval HardAverageModels↓\\quad\\downarrowMathChemistryPhysicsBase model: Llama3\-8B\-InstructPE54\.7353\.6455\.0248\.6946\.2541\.8344\.0640\.1248\.04CoT55\.85\(↑1\.12\)54\.47\(↑0\.83\)55\.48\(↑0\.46\)50\.36\(↑1\.67\)48\.07\(↑1\.82\)42\.99\(↑1\.16\)45\.80\(↑1\.74\)41\.84\(↑1\.72\)49\.36RAG57\.45\(↑2\.72\)56\.59\(↑2\.95\)56\.98\(↑1\.96\)51\.75\(↑3\.06\)50\.58\(↑4\.33\)44\.83\(↑3\.00\)47\.06\(↑3\.00\)44\.12\(↑4\.00\)51\.17MoDULA54\.37\(↓0\.36\)52\.95\(↓0\.69\)55\.84\(↑0\.82\)49\.72\(↑1\.03\)47\.55\(↑1\.30\)41\.80\(↓0\.03\)45\.62\(↑1\.56\)41\.23\(↑1\.11\)48\.64MoDE54\.61\(↑0\.12\)53\.97\(↑0\.33\)56\.14\(↑1\.12\)50\.80\(↑2\.11\)49\.25\(↑3\.00\)42\.77\(↑0\.94\)44\.02\(↓0\.04\)42\.52\(↑2\.40\)49\.26DES\-MoE55\.15\(↑0\.42\)53\.42\(↓0\.22\)56\.50\(↑1\.48\)51\.31\(↑2\.62\)48\.84\(↑2\.59\)43\.96\(↑2\.13\)43\.58\(↓0\.48\)41\.96\(↑1\.84\)49\.34G\-Designer59\.60\(↑4\.87\)57\.86\(↑4\.22\)57\.15\(↑2\.13\)54\.08\(↑5\.39\)52\.63\(↑6\.38\)45\.91\(↑4\.08\)50\.11\(↑6\.05\)46\.32\(↑6\.20\)52\.96AgentPrune59\.48\(↑4\.75\)58\.16\(↑4\.52\)56\.93\(↑1\.91\)54\.50\(↑5\.81\)52\.09\(↑5\.84\)45\.24\(↑3\.41\)50\.85\(↑6\.79\)45\.88\(↑5\.76\)52\.89ARG\-Designer60\.82\(↑6\.09\)59\.52\(↑5\.88\)57\.11\(↑2\.09\)56\.27\(↑7\.58\)53\.85\(↑7\.60\)46\.73\(↑4\.90\)52\.59\(↑8\.53\)47\.14\(↑7\.02\)54\.25AgentDropout60\.40\(↑5\.67\)58\.86\(↑5\.22\)57\.10\(↑2\.08\)55\.79\(↑7\.10\)53\.54\(↑7\.29\)45\.98\(↑4\.15\)52\.10\(↑8\.04\)46\.96\(↑6\.84\)53\.84TopoPrior\+GD61\.83\(↑7\.10\)58\.72\(↑5\.08\)59\.05\(↑4\.03\)56\.51\(↑7\.82\)54\.30\(↑8\.05\)46\.95\(↑5\.12\)52\.63\(↑8\.57\)48\.41\(↑8\.29\)54\.80TopoPrior\+AP61\.16\(↑6\.43\)60\.35\(↑6\.71\)58\.84\(↑3\.82\)56\.07\(↑7\.38\)53\.98\(↑7\.73\)47\.72\(↑5\.89\)52\.93\(↑8\.87\)48\.22\(↑8\.10\)54\.91TopoPrior\+ARG63\.54\(↑8\.81\)62\.60\(↑8\.96\)60\.21\(↑5\.19\)58\.15\(↑9\.46\)55\.82\(↑9\.57\)49\.33\(↑7\.50\)55\.76\(↑11\.70\)50\.89\(↑10\.77\)57\.04TopoPrior\+AD62\.80\(↑8\.07\)61\.42\(↑7\.78\)58\.96\(↑3\.94\)57\.40\(↑8\.71\)55\.36\(↑9\.11\)48\.44\(↑6\.61\)53\.86\(↑9\.80\)49\.87\(↑9\.75\)56\.01Base model: DeepSeek\-V3\-671B\-InstructPE86\.8884\.7587\.9479\.7677\.3971\.2174\.4869\.6479\.01CoT86\.63\(↓0\.25\)85\.62\(↑0\.87\)88\.64\(↑0\.70\)81\.75\(↑1\.99\)78\.35\(↑0\.96\)72\.78\(↑1\.57\)75\.62\(↑1\.14\)70\.81\(↑1\.17\)80\.03RAG88\.82\(↑1\.94\)86\.88\(↑2\.13\)89\.70\(↑1\.76\)81\.92\(↑2\.16\)79\.54\(↑2\.15\)73\.39\(↑2\.18\)76\.85\(↑2\.37\)72\.03\(↑2\.39\)81\.14G\-Designer90\.35\(↑3\.47\)88\.52\(↑3\.77\)91\.40\(↑3\.46\)83\.43\(↑3\.67\)80\.82\(↑3\.43\)75\.28\(↑4\.07\)78\.93\(↑4\.45\)74\.14\(↑4\.50\)82\.86AgentPrune90\.62\(↑3\.74\)88\.84\(↑4\.09\)90\.98\(↑3\.04\)84\.11\(↑4\.35\)81\.08\(↑3\.69\)74\.45\(↑3\.24\)81\.15\(↑6\.67\)75\.32\(↑5\.68\)83\.32ARG\-Designer91\.49\(↑4\.61\)89\.87\(↑5\.12\)91\.90\(↑3\.96\)85\.83\(↑6\.07\)82\.76\(↑5\.37\)76\.50\(↑5\.29\)81\.97\(↑7\.49\)76\.06\(↑6\.42\)84\.42AgentDropout90\.83\(↑3\.95\)88\.15\(↑3\.40\)91\.22\(↑3\.28\)83\.64\(↑3\.88\)81\.66\(↑4\.27\)76\.17\(↑4\.96\)80\.94\(↑6\.46\)75\.80\(↑6\.16\)83\.55TopoPrior\+GD91\.80\(↑4\.92\)89\.67\(↑4\.92\)92\.52\(↑4\.58\)85\.33\(↑5\.57\)82\.95\(↑5\.56\)76\.84\(↑5\.63\)80\.50\(↑6\.02\)76\.69\(↑7\.05\)84\.54TopoPrior\+AP91\.19\(↑4\.31\)90\.51\(↑5\.76\)92\.27\(↑4\.33\)85\.30\(↑5\.54\)82\.78\(↑5\.39\)77\.03\(↑5\.82\)82\.41\(↑7\.93\)76\.98\(↑7\.34\)84\.81TopoPrior\+ARG93\.02\(↑6\.14\)92\.46\(↑7\.71\)93\.08\(↑5\.14\)86\.68\(↑6\.92\)84\.83\(↑7\.44\)78\.31\(↑7\.10\)83\.94\(↑9\.46\)78\.06\(↑8\.42\)86\.30TopoPrior\+AD91\.75\(↑4\.87\)90\.32\(↑5\.57\)91\.79\(↑3\.85\)85\.96\(↑5\.00\)83\.03\(↑5\.64\)77\.14\(↑5\.93\)81\.95\(↑7\.47\)76\.40\(↑6\.76\)84\.79Base model: Qwen2\.5\-72B\-InstructPE84\.4282\.5585\.2978\.3076\.5770\.5574\.0268\.1877\.49CoT84\.53\(↑0\.11\)82\.10\(↓0\.45\)85\.27\(↓0\.02\)80\.19\(↑1\.89\)78\.24\(↑1\.67\)71\.79\(↑1\.24\)75\.33\(↑1\.31\)69\.85\(↑1\.67\)78\.41RAG85\.40\(↑0\.98\)84\.06\(↑1\.51\)86\.55\(↑1\.26\)82\.12\(↑3\.82\)79\.69\(↑3\.12\)73\.40\(↑2\.85\)76\.73\(↑2\.71\)71\.31\(↑3\.13\)79\.91G\-Designer86\.58\(↑2\.16\)85\.21\(↑2\.66\)88\.07\(↑2\.78\)84\.29\(↑5\.99\)80\.28\(↑3\.71\)74\.99\(↑4\.44\)78\.87\(↑4\.85\)73\.96\(↑5\.78\)81\.53AgentPrune86\.03\(↑1\.61\)86\.15\(↑3\.60\)88\.22\(↑2\.93\)85\.93\(↑7\.63\)80\.79\(↑4\.22\)75\.47\(↑4\.92\)79\.34\(↑5\.32\)74\.84\(↑6\.66\)82\.10ARG\-Designer87\.63\(↑3\.21\)87\.72\(↑5\.17\)89\.16\(↑3\.87\)86\.57\(↑8\.27\)81\.85\(↑5\.28\)76\.74\(↑6\.19\)81\.41\(↑7\.39\)76\.45\(↑8\.27\)82\.19AgentDropout86\.57\(↑2\.15\)85\.70\(↑3\.15\)87\.96\(↑2\.67\)85\.52\(↑7\.22\)80\.44\(↑3\.87\)75\.30\(↑4\.75\)79\.39\(↑5\.37\)74\.68\(↑6\.50\)81\.95TopoPrior\+GD87\.65\(↑3\.23\)87\.10\(↑4\.55\)88\.74\(↑3\.45\)86\.65\(↑8\.35\)82\.11\(↑5\.54\)77\.16\(↑6\.61\)81\.52\(↑7\.50\)76\.51\(↑8\.33\)82\.18TopoPrior\+AP88\.67\(↑4\.25\)86\.84\(↑4\.29\)89\.73\(↑4\.44\)86\.92\(↑8\.62\)81\.90\(↑5\.33\)77\.65\(↑7\.10\)80\.44\(↑6\.42\)76\.69\(↑8\.51\)82\.36TopoPrior\+ARG90\.16\(↑5\.74\)89\.25\(↑6\.70\)91\.21\(↑5\.92\)88\.07\(↑9\.77\)83\.32\(↑6\.75\)79\.09\(↑8\.54\)83\.50\(↑9\.48\)79\.91\(↑11\.73\)85\.56TopoPrior\+AD88\.59\(↑4\.17\)87\.68\(↑5\.13\)88\.64\(↑3\.35\)86\.53\(↑8\.23\)82\.81\(↑6\.24\)77\.55\(↑7\.00\)81\.36\(↑7\.34\)77\.14\(↑8\.96\)83\.79
### IV\-DMain Results

Table[III](https://arxiv.org/html/2605.17359#S4.T3)and Table[IV](https://arxiv.org/html/2605.17359#S4.T4)summarize the main results on MMLU and C\-Eval under three LLM backbones\. Overall, dynamic topology\-evolution methods consistently outperform single\-agent baselines in most evaluated settings, and equipping them withTopoPriorfurther improves performance in most cases\. These results suggest that transferable topology priors can provide effective initialization for downstream multi\-agent collaboration in multi\-domain reasoning\.

Compared with training\-free methods, RAG is the strongest single\-agent baseline in most settings, especially on knowledge\-intensive domains such as*Law, Government, and Public Affairs*,*Ethics and Morality*, and*Social Sciences*\. Training\-intensive methods exhibit more mixed behavior across heterogeneous domains and are generally weaker than dynamic multi\-agent approaches in our evaluation\. By contrast, topology\-evolution methods provide the strongest baseline results overall, which supports the value of role specialization and structured communication for multi\-domain reasoning\.

Built on top of these topology\-evolution backbones,TopoPrioryields consistent gains across benchmarks and model scales\. For example, on MMLU with Llama3\-8B\-Instruct,TopoPrior\+ARG improves the average accuracy from65\.58to68\.53\(\+2\.95points\), andTopoPrior\+AD improves AgentDropout from65\.72to67\.31\(\+1\.59points\)\. On C\-Eval with the same backbone,TopoPrior\+ARG improves ARG\-Designer from54\.25to57\.04\(\+2\.79points\), whileTopoPrior\+AD improves AgentDropout from53\.84to56\.01\(\+2\.17points\)\. Similar trends are observed on larger backbones, including DeepSeek\-V3\-671B\-Instruct and Qwen2\.5\-72B\-Instruct\.

The improvements are particularly visible in reasoning\-intensive and knowledge\-intensive categories\. On MMLU with Llama3\-8B\-Instruct,TopoPrior\+ARG achieves the best performance in*Ethics and Morality*\(64\.08\) and ranks among the strongest methods in several other domains, whileTopoPrior\+AD performs best on*Law, Government, and Public Affairs*\(66\.88\) and*Business and Management*\(69\.05\)\. On C\-Eval and C\-Eval Hard, the gains are also evident on challenging scientific reasoning tasks\. For instance, under Llama3\-8B\-Instruct,TopoPrior\+ARG reaches49\.33,55\.76, and50\.89on Mathematics, Chemistry, and Physics, respectively, outperforming the corresponding backbone without topology\-prior initialization\.

Among the evaluated backbones, ARG\-Designer appears to benefit the most fromTopoPrior\. This trend is observed across both benchmarks and all three LLM backbones, suggesting that autoregressive graph construction may be particularly sensitive to initialization quality\. More broadly, the fact thatTopoPriorimproves several heterogeneous backbones, rather than only the method used to construct the reference graphs, suggests that it captures reusable collaboration regularities beyond a teacher\-specific search pattern\.

TABLE V:Ablation results ofTopoPrioron three representative MMLU domains using ARG\-Designer\.Domain→\\rightarrowNaturalSciencesLaw, Gov\.Public AffairsEthicsMoralityModel↓\\downarrowBase model: Llama3\-8B\-InstructTopoPrior\+ARG68\.9166\.1564\.08w/oℒprior\\mathcal\{L\}\_\{\\mathrm\{prior\}\}64\.7162\.9259\.81w/oℒadapt\\mathcal\{L\}\_\{\\mathrm\{adapt\}\}66\.1464\.6862\.50w/ofpriorf\_\{\\mathrm\{prior\}\}65\.3163\.7060\.25Base model: Qwen2\.5\-72B\-InstructTopoPrior\+ARG90\.8786\.7686\.01w/oℒprior\\mathcal\{L\}\_\{\\mathrm\{prior\}\}85\.1583\.9283\.26w/oℒadapt\\mathcal\{L\}\_\{\\mathrm\{adapt\}\}87\.8385\.5785\.13w/ofpriorf\_\{\\mathrm\{prior\}\}86\.4285\.0684\.55
### IV\-EAblation Study

Table[V](https://arxiv.org/html/2605.17359#S4.T5)reports ablation results on three representative MMLU domains under two LLM backbones\. UsingTopoPrior\+ARG as the full model, we examine the contributions of transferable topology\-prior learning, latent adaptation, and the query\-conditioned prior\. Removingℒprior\\mathcal\{L\}\_\{\\mathrm\{prior\}\}causes the largest performance drop \(up to 4\.27 points on Llama3\-8B and 5\.72 points on Qwen2\.5\-72B\), indicating that topology\-prior learning is the central component ofTopoPrior\. Without this objective, the model is less able to capture reusable structural regularities, leading to weaker graph initialization\. Removingℒadapt\\mathcal\{L\}\_\{\\mathrm\{adapt\}\}also degrades performance on both backbones, although less severely than removingℒprior\\mathcal\{L\}\_\{\\mathrm\{prior\}\}\. This suggests that latent\-space alignment improves the robustness of the learned topology prior across domains by reducing domain discrepancy while still preserving useful query\-dependent structural cues\. Removing the query\-conditioned priorfpriorf\_\{\\mathrm\{prior\}\}further reduces performance on both backbones\. This result suggests that the learned prior remains important at inference time: without it, the generator loses an informative query\-specific initialization signal and becomes less effective at constructing initial collaboration graphs for downstream refinement\. Overall, the three components are complementary\. The topology\-prior objective contributes the largest gain, while latent adaptation and the query\-conditioned prior provide additional improvements for more stable topology initialization\. Their relative ordering is similar across the two backbones, suggesting that these effects are reasonably stable across model scales\.

![Refer to caption](https://arxiv.org/html/2605.17359v1/x3.png)Figure 3:Accuracy gain, inference\-time token reduction, and communication\-round reduction ofTopoPriorwhen combined with different topology\-evolution backbones on MMLU using Llama3\-8B\-Instruct\.TABLE VI:Comparison of average accuracy and trainable parameters on MMLU with Llama3\-8B\-Instruct\. Relative parameter increase \(Δ\\Delta\) is computed w\.r\.t\. the 8B backbone\.MethodAcc\. \(%\)Trainable Parameters \(M\)Δ\\Delta\(%\)MoDULA56\.471,77522\.19MoDE60\.122,37629\.70DES\-MoE59\.101,58519\.81TopoPrior\+ARG68\.533\.3\+3\.8=7\.10\.09
### IV\-FEfficiency Analysis

##### Efficiency–Performance Trade\-off\.

We assess the trade\-off among performance gain, online token efficiency, and communication\-round reduction by equippingTopoPriorwith different topology\-evolution backbones on MMLU using Llama3\-8B\-Instruct\. Following the protocol described in Section[IV\-C](https://arxiv.org/html/2605.17359#S4.SS3), token consumption is measured over online inference only\. We also compare training\-intensive baselines to assess adaptation quality relative to trainable parameter cost\.

Figure[3](https://arxiv.org/html/2605.17359#S4.F3)shows thatTopoPriorimproves all evaluated topology\-evolution methods while reducing online communication cost\. The outer box labeled “Acc\.” and the inner box labeled “Rounds” indicate changes in accuracy and communication rounds, respectively\. Among the evaluated methods, ARG\-Designer appears to benefit the most, achieving the largest average accuracy gain \(\+2\.95\+2\.95points\) together with the largest reduction in communication rounds \(6 rounds on average\)\. This observation is consistent with the main results and suggests that autoregressive graph construction may be particularly sensitive to initialization quality\. By contrast, pruning\-based methods such as AgentPrune benefit less from improved initialization, possibly because their downstream refinement process retains more redundant agents and edges\.

Table[VI](https://arxiv.org/html/2605.17359#S4.T6)further summarizes adaptation performance and parameter cost\.TopoPrior\+ARG achieves the best average accuracy \(68\.53%\) while introducing only 3\.3M additional parameters inTopoPrioritself; the total trainable parameter count ofTopoPrior\+ARG is 7\.1M, including 3\.8M parameters from the original ARG\-Designer\. This increase is small relative to the 8B backbone\. By contrast, training\-intensive baselines require substantially more trainable parameters while achieving lower average accuracy on this benchmark\. These results suggest thatTopoPriorprovides a favorable trade\-off between adaptation quality and parameter efficiency under the evaluated setting\. We note, however, that these efficiency results characterize online inference savings and parameter overhead, rather than the full amortized cost including offline reference\-graph construction and prior training\.

TABLE VII:Performance ofTopoPriorunder weak and strong topology supervision\.SettingAccuracy \(%\)Weak Teacher \(50% convergence\)61\.24TopoPrior\+ Weak Teacher64\.33Strong Teacher \(full convergence\)65\.58TopoPrior\+ Strong Teacher68\.53TABLE VIII:Effect of different supervision sources on model performance and inference\-time token efficiency\. Token reduction is measured against the ARG\-Designer baseline\.Supervision SourceAccuracy \(%\)TokenReduction \(%\)ARG\-Designer65\.58–\+Full68\.5340\.2\+Cheap\-early64\.3332\.7\+Static\-template62\.1418\.3\+Random60\.2520\.5
##### Weak Topology Supervision\.

We further examine whetherTopoPriorremains effective under degraded and simplified supervision\. Table[VII](https://arxiv.org/html/2605.17359#S4.T7)compares strong teacher graphs obtained from fully converged AgentDropout with weak teacher graphs obtained by stopping AgentDropout after 50% of its convergence rounds\. Even under weak teacher supervision,TopoPriorreaches 64\.33% accuracy, improving over the weak teacher itself by 3\.09 points and remaining only 1\.25 points below the fully supervised setting\. This result suggests thatTopoPriorcan still extract useful collaboration patterns from imperfect topology supervision\.

We further compare alternative supervision sources in Table[VIII](https://arxiv.org/html/2605.17359#S4.T8), including fully converged graphs \(Full\), partially converged graphs \(Cheap\-early\), domain\-heuristic static templates \(Static\-template\), and random graphs\. All variants are used to trainTopoPrior, and the resulting generator is evaluated by initializing ARG\-Designer\. The results show thatTopoPriorremains effective even when supervision is simplified: partially converged graphs already provide clear improvements, while static templates still yield moderate gains\. By contrast, random graphs do not produce competitive results, suggesting thatTopoPriorlearns meaningful collaboration patterns rather than merely fitting superficial graph statistics\.

![Refer to caption](https://arxiv.org/html/2605.17359v1/x4.png)Figure 4:t\-SNE visualization of the encoder\-induced latent space across the seven MMLU domains\.![Refer to caption](https://arxiv.org/html/2605.17359v1/x5.png)Figure 5:Out\-of\-domain generalization on unseen domains\. “GD”, “AP”, “ARG”, and “AD” denote G\-Designer, AgentPrune, ARG\-Designer, and AgentDropout, respectively\.

### IV\-GRepresentation and Generalization Analysis

##### Multi\-Domain Latent Structure\.

To assess whetherTopoPriorlearns latent representations that are both reusable and domain\-sensitive, we visualize the latent variablezzfor all seven major MMLU domains\. For each domain, we sample 400 examples, encode them into the latent space using the learned encoder, and project the resulting 128\-dimensional representations to two dimensions using PCA followed by t\-SNE\[[49](https://arxiv.org/html/2605.17359#bib.bib90)\]\. Figure[4](https://arxiv.org/html/2605.17359#S4.F4)shows that the seven domains form several coherent clusters in the latent space\. In particular,*Natural Sciences*and*Engineering & Technology*exhibit partial overlap, likely due to shared quantitative and formal reasoning patterns, while*Social Sciences*and*Business & Management*also overlap to some extent because of related economic and decision\-oriented concepts\. Other domains remain relatively distinct\. Although this visualization is qualitative, it is consistent with the intended behavior ofTopoPrior: the latent space captures reusable structural regularities while retaining domain\-related variation that may be useful for topology initialization\.

##### Out\-of\-Domain Generalization\.

To evaluate generalization beyond the training domains, we conduct experiments on two unseen domains that are not included in MMLU, namely Art\[[13](https://arxiv.org/html/2605.17359#bib.bib16)\]and Military\[[60](https://arxiv.org/html/2605.17359#bib.bib17)\], using Llama3\-8B\-Instruct\. For a new queryqnewq\_\{\\mathrm\{new\}\}, we obtain a latent representationznew∼pθ′​\(z∣qnew\)z\_\{\\mathrm\{new\}\}\\sim p^\{\\prime\}\_\{\\theta\}\(z\\mid q\_\{\\mathrm\{new\}\}\)from the learned prior network and compute its cosine similarity to the centroids of the seven in\-domain latent clusters identified in Fig\.[4](https://arxiv.org/html/2605.17359#S4.F4)\. Figure[5](https://arxiv.org/html/2605.17359#S4.F5)shows that the learned prior maps unseen queries to semantically plausible regions of the latent space by associating them with related in\-domain clusters, such as placing Art closer to*Humanities & History*\. This behavior suggests that the topology initializer can produce plausible collaboration graphs even for unseen domains, which is consistent with the observed gains in out\-of\-domain performance\.

TABLE IX:Low\-resource training results over MMLU ofTopoPrior\+ ARG\-Designer in terms of accuracy \(%\)\.DomainWithTopoPrior?Δ\\DeltaWithTopoPrior?Δ\\DeltaWithTopoPrior?Δ\\DeltaNoYesNoYesNoYesTraining data5% of the original10% of the original20% of the originalNatural Science60\.3461\.951\.61↑\\uparrow61\.8962\.750\.86↑\\uparrow62\.6464\.962\.32↑\\uparrowEngineering & Technology63\.4664\.801\.34↑\\uparrow64\.7266\.051\.33↑\\uparrow65\.9467\.581\.64↑\\uparrowSocial Sciences58\.2560\.862\.61↑\\uparrow60\.1363\.313\.18↑\\uparrow61\.0865\.424\.34↑\\uparrowHumanities & History62\.8863\.921\.04↑\\uparrow64\.7165\.831\.12↑\\uparrow66\.1567\.891\.74↑\\uparrowLaw, Government, Public Affairs55\.1357\.642\.51↑\\uparrow58\.5560\.071\.52↑\\uparrow60\.4163\.322\.91↑\\uparrowEthics & Morality53\.6656\.242\.58↑\\uparrow56\.8558\.701\.85↑\\uparrow59\.1960\.611\.42↑\\uparrowBusiness & Management59\.4061\.882\.48↑\\uparrow61\.2363\.652\.42↑\\uparrow62\.9065\.772\.87↑\\uparrowAverage59\.0261\.042\.02↑\\uparrow61\.1562\.911\.76↑\\uparrow62\.6265\.082\.46↑\\uparrow

### IV\-HLow\-Resource Learning Analysis

To evaluate the robustness ofTopoPriorin data\-scarce domains, we conduct a low\-resource analysis to assess whetherTopoPriorcan effectively leverage limited domain data\. We compare the results of ARG\-Designer with and withoutTopoPriorunder several low\-resource regimes on MMLU\. Specifically, we subsample the original training data to 5%, 10%, and 20% of the full size in each domain\. TheTopoPriorgenerator is trained using the same hyperparameters as in the full\-data setting and subsequently evaluated on the corresponding test splits\.

Table[IX](https://arxiv.org/html/2605.17359#S4.T9)shows thatTopoPriorimproves the results of ARG\-Designer across all data fractions, suggesting that it can transfer useful topological priors even under limited supervision\. At the lowest data setting \(5%\), incorporatingTopoPrioryields particularly clear gains \(e\.g\., \+1\.04 points in “Humanities & History” and \+2\.58 points in “Ethics & Morality”\), indicating that the learned prior can provide structural guidance when labeled examples are scarce\. Despite fluctuations in absolute performance,TopoPriormaintains a positive improvement margin across all training\-data sizes\. These results suggest that the learned latent space remains useful in low\-resource settings\.

### IV\-IFixed Agent Pool Domain Generalization

We next discuss why a fixed role pool may still generalize to niche and out\-of\-distribution \(OOD\) domains\. The key intuition is that semantic similarity between queries and roles, together with the compositional flexibility of collaboration graphs, can enable role reuse beyond the domains explicitly represented in the training partition\. We support this perspective with both conceptual discussion and empirical evidence\.

As shown in Table[II](https://arxiv.org/html/2605.17359#S4.T2), theTopoPriorrole pool spans broad domain categories \(e\.g\., Natural Sciences, Engineering, and Social Sciences\), each comprising multiple subdomains\. Niche domains can often be associated with these broader categories\. For example, a genetics query falls under Natural Sciences/Biology and can be addressed by both the “Natural Science Expert” and “Medical Life Scientist” roles\. Crucially,TopoPriordoes not assign roles statically; rather, it composes multiple roles and their interactions according to the content and context of each query\. For a genetics question, the generated graph may include\(1\)a Natural Science Expert for core biological knowledge,\(2\)a Medical Life Scientist when the question involves genetic disorders,\(3\)a Chemistry Specialist for molecular genetics topics such as DNA structure, and\(4\)a General Coordinator to orchestrate collaboration\. Hence, even without an explicit “Genetics Expert” role, a combination of existing roles may still address the query effectively\. The learned initializer adapts role composition to the query context, enabling a degree of generalization through semantic coverage\.

TABLE X:Performance on niche domains\.ModelAstro\-QATREC GenomicsAcc\. \(%\)Oracle GapAcc\. \(%\)Oracle GapARG\-Designer71\.4–3\.761\.8–3\.3AutoGen68\.2–5\.960\.6–4\.5Oracle74\.1–65\.1–TopoPrior\+ARG73\.5–0\.664\.3–0\.8#### IV\-I1Domain Validation of Generalization

To further examine whether a fixed role pool can remain useful in niche domains, we construct two niche\-domain test sets: Astro\-QA\[[30](https://arxiv.org/html/2605.17359#bib.bib7)\]for astrophysics and TREC Genomics\[[16](https://arxiv.org/html/2605.17359#bib.bib6)\]for genetics\. These domains are not explicitly represented in the original role pool\. Specifically, astrophysics can be viewed as a subdomain of Natural Sciences/Physics, while genetics falls under Natural Sciences/Biology/Medicine\. We compare three configurations:\(1\)ARG\-Designer and AutoGen\[[52](https://arxiv.org/html/2605.17359#bib.bib82)\], which dynamically generate roles from scratch;\(2\)TopoPrior\+ARG, which uses the fixed role pool; and\(3\)Oracle, in which a dedicated domain\-specific role is added to the pool as a reference upper bound\. Table[X](https://arxiv.org/html/2605.17359#S4.T10)shows thatTopoPrior\+ARG performs within 1% of the Oracle, suggesting that semantic role composition can provide useful coverage for niche domains\. By contrast, the dynamic role\-generation baselines \(ARG\-Designer and AutoGen\) perform worse in this setting, suggesting that reusing semantically related roles from a fixed pool may be more effective than generating roles from scratch\.

TABLE XI:Token cost analysis for training and inference\.MetricValueAvg\. tokens per training graph \(ARG\)∼1,200\\sim 1\{,\}200Total training tokens \(∼100\\sim 100k samples\)∼120\\sim 120MAvg\. inference tokens per test query\(ARG alone\)∼800\\sim 800Avg\. inference tokens per test query\(TopoPrior\+ARG\)∼478\\sim 478Token savings per test query322322Total inference savings \(MMLU test set\)4\.834\.83MBreak\-even test queries needed120120M//322≈373,670322\\approx 373\{,\}670![Refer to caption](https://arxiv.org/html/2605.17359v1/x6.png)Figure 6:Hyperparameter analysis of the loss coefficientsα\\alpha,β\\beta, and edge generation thresholdδe\\delta\_\{e\}\. “NS”, “ET”, “SS”, “HH”, “LGP”, “EM” and “BM” are abbreviations for different domains in MMLU and C\-Eval, respectively\.

### IV\-JAnalysis of End\-to\-End Token Cost

The one\-time cost of generating reference collaboration graphs for trainingTopoPriormay be amortized by token savings during inference, which can make the overall pipeline cost\-effective at sufficiently large deployment scales\. Reference graphs are generated only once during offline training\. After training, theTopoPriorgenerator can produce initial collaboration topologies for any number of test queries without re\-running the expensive topology\-evolution algorithm\. For any domain, the supervision cost scales with the training set size, whereas inference benefits from a40\.2%token reduction \(see Section[IV\-F](https://arxiv.org/html/2605.17359#S4.SS6)\) per test query\. Token savings therefore accumulate as more test queries are processed and may eventually offset the initial offline cost\.

We measure the total training token cost for graph generation on the MMLU training set \(seven domains,∼\\sim100k samples\) using ARG\-Designer, and compute total inference savings usingTopoPrior\+ARG on the MMLU test set \(15k samples\), extrapolating to larger test sets\. As shown in Table[XI](https://arxiv.org/html/2605.17359#S4.T11), the MMLU test set alone does not recoup the training cost \(4\.83M saved tokens versus 120M tokens invested\)\. However, at larger deployment scales, the break\-even point is reached after approximately373,670queries\. OnceTopoPrioris trained, it can be reused across domains and datasets without additional graph generation, which further improves amortization in repeated\-use scenarios\.

### IV\-KHyperparameter Analysis

We analyze key hyperparameters, including the loss coefficientsα\\alphaandβ\\beta, as well as the edge\-generation thresholdδe\\delta\_\{e\}, across multiple dataset subdomains using Llama3\-8B\-Instruct\. Figure[6](https://arxiv.org/html/2605.17359#S4.F6)reports the average performance over these settings\. The best hyperparameter combination is used for the main experimental results\.

## VConclusion

In this paper, we presentedTopoPrior, a framework for transferable topology prior learning for multi\-agent LLM collaboration in multi\-domain settings\. Rather than constructing collaboration graphs from scratch for each task,TopoPriorlearns reusable, query\-conditioned topology priors from reference collaboration graphs and uses them to initialize downstream topology\-evolution backbones\. Experiments on MMLU, C\-Eval, and additional out\-of\-domain settings show thatTopoPriorimproves downstream reasoning performance in the evaluated settings while reducing online communication cost, token usage, and refinement rounds\. These results suggest that transferable topology\-prior learning is a practical direction for improving the efficiency and adaptability of multi\-agent LLM systems\.

Several directions remain for future work\. One is to reduce reliance on expensive offline supervision by using cheaper or more weakly supervised reference graphs\. Another is to extend the current predefined role pool toward more open\-ended role discovery and role composition\. It is also promising to investigate tighter integration between topology\-prior learning and downstream topology refinement in interactive reasoning settings\. From a broader\-impact perspective,TopoPriormay improve the efficiency and accessibility of multi\-agent LLM systems by reducing communication overhead and improving collaboration quality\. At the same time, such systems should be deployed with appropriate evaluation, transparency, and human oversight, especially in high\-stakes applications\.

## References

- \[1\]\(2018\)Deep learning using rectified linear units \(relu\)\.CoRRabs/1803\.08375\.Cited by:[§IV\-C](https://arxiv.org/html/2605.17359#S4.SS3.p3.4)\.
- \[2\]R\. Angell and A\. McCallum\(2024\)Fast, scalable, warm\-start semidefinite programming with spectral bundling and sketching\.InProc\. Int\. Conf\. Mach\. Learn\.,pp\. 1579–1615\.Cited by:[§II\-C](https://arxiv.org/html/2605.17359#S2.SS3.p1.1)\.
- \[3\]Y\. Cao, S\. Han, Z\. Gao, Z\. Ding, X\. Xie, and S\. K\. Zhou\(2025\)GraphInsight: unlocking insights in large language models for graph structure understanding\.InProc\. Annu\. Meeting Assoc\. Comput\. Linguistics,pp\. 12096–12134\.Cited by:[§II\-C](https://arxiv.org/html/2605.17359#S2.SS3.p1.1)\.
- \[4\]H\. Chen, X\. Han, Z\. Wu, and Y\. Jiang\(2023\)Multi\-prompt alignment for multi\-source unsupervised domain adaptation\.InProc\. Adv\. Neural Inf\. Process\. Syst\.,Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.17359#S2.SS1.p1.1)\.
- \[5\]Q\. Chen, L\. Qin, J\. Zhang, Z\. Chen, X\. Xu, and W\. Che\(2024\)M3cot: A novel benchmark for multi\-domain multi\-step multi\-modal chain\-of\-thought\.InProc\. Annu\. Meeting Assoc\. Comput\. Linguistics,pp\. 8199–8221\.Cited by:[§II\-A](https://arxiv.org/html/2605.17359#S2.SS1.p1.1)\.
- \[6\]J\. Chung, Ç\. Gülçehre, K\. Cho, and Y\. Bengio\(2014\)Empirical evaluation of gated recurrent neural networks on sequence modeling\.CoRRabs/1412\.3555\.Cited by:[§IV\-C](https://arxiv.org/html/2605.17359#S4.SS3.p3.4)\.
- \[7\]DeepSeek\-AI\(2025\)DeepSeek\-v3 technical report\.External Links:2412\.19437Cited by:[§IV\-B](https://arxiv.org/html/2605.17359#S4.SS2.p1.1),[§IV\-C](https://arxiv.org/html/2605.17359#S4.SS3.p2.2)\.
- \[8\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova\(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProc\. Annu\. Conf\. North Am\. Chapter Assoc\. Comput\. Linguist\.,pp\. 4171–4186\.Cited by:[§III\-B](https://arxiv.org/html/2605.17359#S3.SS2.p3.7),[§IV\-C](https://arxiv.org/html/2605.17359#S4.SS3.p2.2)\.
- \[9\]G\. Dey and Y\. K\. Lal\(2025\)On the transferability of causal knowledge for language models\.InProc\. Annu\. Meeting Assoc\. Comput\. Linguistics,pp\. 8–14\.Cited by:[§II\-C](https://arxiv.org/html/2605.17359#S2.SS3.p1.1)\.
- \[10\]Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch\(2024\)Improving factuality and reasoning in language models through multiagent debate\.InProc\. Int\. Conf\. Mach\. Learn\.,Cited by:[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p1.1)\.
- \[11\]D\. Dua, E\. Strubell, S\. Singh, and P\. Verga\(2023\)To adapt or to annotate: challenges and interventions for domain adaptation in open\-domain question answering\.InProc\. Annu\. Meeting Assoc\. Comput\. Linguistics,pp\. 14429–14446\.Cited by:[§II\-C](https://arxiv.org/html/2605.17359#S2.SS3.p1.1)\.
- \[12\]Y\. Ganin and V\. S\. Lempitsky\(2015\)Unsupervised domain adaptation by backpropagation\.InProc\. Int\. Conf\. Mach\. Learn\.,pp\. 1180–1189\.Cited by:[§III\-C](https://arxiv.org/html/2605.17359#S3.SS3.p1.2)\.
- \[13\]N\. Garcia and G\. Vogiatzis\(2018\)How to read paintings: semantic art understanding with multi\-modal retrieval\.InProc\. Eur\. Conf\. Comput\. Vis\.,pp\. 676–691\.Cited by:[§IV\-G](https://arxiv.org/html/2605.17359#S4.SS7.SSS0.Px2.p1.2)\.
- \[14\]D\. Goswami, R\. Schuster, J\. van de Weijer, and D\. Stricker\(2023\)Attribution\-aware weight transfer: A warm\-start initialization for class\-incremental semantic segmentation\.InProc\. Winter Conf\. Appl\. Comput\. Vis\.,pp\. 3194–3203\.Cited by:[§II\-C](https://arxiv.org/html/2605.17359#S2.SS3.p1.1)\.
- \[15\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2021\)Measuring massive multitask language understanding\.InProc\. Int\. Conf\. Learn\. Represent\.,Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p1.1),[§I](https://arxiv.org/html/2605.17359#S1.p5.1),[§IV\-A](https://arxiv.org/html/2605.17359#S4.SS1.p1.1)\.
- \[16\]W\. R\. Hersh, A\. M\. Cohen, J\. Yang, R\. T\. Bhupatiraju, P\. M\. Roberts, and M\. A\. Hearst\(2007\)TREC 2007 genomics track overview\.InText Retrieval Conference,Cited by:[§IV\-I1](https://arxiv.org/html/2605.17359#S4.SS9.SSS1.p1.1)\.
- \[17\]S\. Holt, M\. R\. Luyten, and M\. van der Schaar\(2023\)L2MAC: large language model automatic computer for unbounded code generation\.CoRRabs/2310\.02003\.Cited by:[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p1.1)\.
- \[18\]S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, J\. Wang, C\. Zhang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin, L\. Zhou, C\. Ran, L\. Xiao, C\. Wu, and J\. Schmidhuber\(2024\)MetaGPT: meta programming for A multi\-agent collaborative framework\.InProc\. Int\. Conf\. Learn\. Represention,Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p2.1),[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p1.1)\.
- \[19\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InProc\. Int\. Conf\. Learn\. Represention,Cited by:[§II\-A](https://arxiv.org/html/2605.17359#S2.SS1.p1.1)\.
- \[20\]T\. Hu, P\. Zhang, B\. Yang, J\. Xie, D\. F\. Wong, and R\. Wang\(2024\)Large language model for multi\-domain translation: benchmarking and domain cot fine\-tuning\.InProc\. Conf\. Empir\. Methods Nat\. Lang\. Process\.,pp\. 5726–5746\.Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.17359#S2.SS1.p1.1)\.
- \[21\]Y\. Huang, Y\. Bai, Z\. Zhu, J\. Zhang, J\. Zhang, T\. Su, J\. Liu, C\. Lv, Y\. Zhang, J\. Lei, Y\. Fu, M\. Sun, and J\. He\(2023\)C\-eval: A multi\-level multi\-discipline chinese evaluation suite for foundation models\.InProc\. Adv\. Neural Inf\. Process\. Syst\.,Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p1.1),[§I](https://arxiv.org/html/2605.17359#S1.p5.1),[§IV\-A](https://arxiv.org/html/2605.17359#S4.SS1.p1.1)\.
- \[22\]Y\. Ishibashi and Y\. Nishimura\(2024\)Self\-organized agents: A LLM multi\-agent framework toward ultra large\-scale code generation and optimization\.CoRRabs/2404\.02183\.Cited by:[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p1.1)\.
- \[23\]D\. Jiang, X\. Ren, and B\. Y\. Lin\(2023\)LLM\-blender: ensembling large language models with pairwise ranking and generative fusion\.InProc\. Annu\. Meeting Assoc\. Comput\. Linguistics,pp\. 14165–14178\.Cited by:[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p1.1)\.
- \[24\]T\. N\. Kipf and M\. Welling\(2016\)Variational graph auto\-encoders\.CoRRabs/1611\.07308\.Cited by:[§II\-C](https://arxiv.org/html/2605.17359#S2.SS3.p1.1),[§III\-B](https://arxiv.org/html/2605.17359#S3.SS2.p1.1)\.
- \[25\]T\. N\. Kipf and M\. Welling\(2017\)Semi\-supervised classification with graph convolutional networks\.InProc\. Int\. Conf\. Learn\. Represention,Cited by:[§III\-B](https://arxiv.org/html/2605.17359#S3.SS2.p3.2),[§IV\-C](https://arxiv.org/html/2605.17359#S4.SS3.p3.4)\.
- \[26\]W\. Kwan, X\. Zeng, Y\. Wang, Y\. Sun, L\. Li, Y\. Jiang, L\. Shang, Q\. Liu, and K\. Wong\(2024\)M4LE: A multi\-ability multi\-range multi\-task multi\-domain long\-context evaluation benchmark for large language models\.InProc\. Annu\. Meeting Assoc\. Comput\. Linguistics,pp\. 15568–15592\.Cited by:[§II\-A](https://arxiv.org/html/2605.17359#S2.SS1.p1.1)\.
- \[27\]H\. Lee, K\. C\. Li, M\. Grabmair, and S\. Xu\(2025\)Efficient prompt optimisation for legal text classification with proxy prompt evaluator\.CoRRabs/2510\.08524\.Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p1.1)\.
- \[28\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InProc\. Adv\. Neural Inf\. Process\. Syst\.,Cited by:[§IV\-B](https://arxiv.org/html/2605.17359#S4.SS2.p1.1)\.
- \[29\]H\. Li, A\. Wang, K\. Li, Z\. Wang, L\. Zhang, D\. Qiu, Q\. Liu, and J\. Su\(2025\-11\)A multi\-agent framework with automated decision rule optimization for cross\-domain misinformation detection\.InProc\. Conf\. Empir\. Methods Nat\. Lang\. Process\.,Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p2.1)\.
- \[30\]J\. Li, F\. Zhao, P\. Chen, J\. Xie, X\. Zhang, H\. Li, M\. Chen, Y\. Wang, and M\. Zhu\.\(2025\)An astronomical question answering dataset for evaluating large language models\.Nature Scientific Data\.Cited by:[§IV\-I1](https://arxiv.org/html/2605.17359#S4.SS9.SSS1.p1.1)\.
- \[31\]J\. Li, Z\. Yu, Z\. Du, L\. Zhu, and H\. T\. Shen\(2024\)A comprehensive survey on source\-free domain adaptation\.IEEE Trans\. Pattern Anal\. Mach\. Intell\.46\(8\),pp\. 5743–5762\.Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.17359#S2.SS1.p1.1),[§II\-C](https://arxiv.org/html/2605.17359#S2.SS3.p1.1)\.
- \[32\]J\. Li, B\. Wang, X\. Zhou, and X\. Hu\(2025\)Dynamic expert specialization: towards catastrophic forgetting\-free multi\-domain moe adaptation\.CoRRabs/2509\.16882\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2509.16882),2509\.16882Cited by:[§II\-A](https://arxiv.org/html/2605.17359#S2.SS1.p1.1),[§IV\-B](https://arxiv.org/html/2605.17359#S4.SS2.p2.1)\.
- \[33\]S\. Li, Y\. Liu, Q\. Wen, C\. Zhang, and S\. Pan\(2026\)Assemble your crew: automatic multi\-agent communication topology design via autoregressive graph generation\.InProc\. Assoc\. for the Advan\. of Arti\. Intell\.,pp\. 23142–23150\.Cited by:[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p2.1),[§IV\-B](https://arxiv.org/html/2605.17359#S4.SS2.p3.1),[§IV\-C](https://arxiv.org/html/2605.17359#S4.SS3.p6.1)\.
- \[34\]X\. Liang, L\. Yang, J\. Wang, Y\. Lu, R\. Wu, H\. Chen, and J\. Hao\(2025\)Boosting multi\-domain fine\-tuning of large language models through evolving interactions between samples\.InProc\. Int\. Conf\. Mach\. Learn\.,Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.17359#S2.SS1.p1.1)\.
- \[35\]Z\. Ling, D\. Chen, L\. Yao, Y\. Li, and Y\. Shen\(2025\)Diversity as a reward: fine\-tuning llms on a mixture of domain\-undetermined data\.CoRRabs/2502\.04380\.Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p2.1)\.
- \[36\]Z\. Liu, Y\. Zhang, P\. Li, Y\. Liu, and D\. Yang\(2024\)A dynamic llm\-powered agent network for task\-oriented agent collaboration\.CoRRabs/2310\.02170\.Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p2.1),[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p2.1)\.
- \[37\]Z\. Liu, S\. Huang, T\. Guo, M\. Hou, and Q\. Liang\(2025\-02\)A prompt\-driven framework for multi\-domain knowledge tracing\.Mach\. Learn\.114\(4\)\.External Links:ISSN 0885\-6125,[Document](https://dx.doi.org/10.1007/s10994-024-06660-6)Cited by:[§II\-A](https://arxiv.org/html/2605.17359#S2.SS1.p1.1)\.
- \[38\]Y\. Ma, Z\. Liang, H\. Dai, B\. Chen, D\. Gao, Z\. Ran, Z\. Wang, L\. Jin, W\. Jiang, G\. Zhang, X\. Cai, and L\. Yang\(2024\)MoDULA: mixture of domain\-specific and universal lora for multi\-task learning\.InProc\. Conf\. Empir\. Methods Nat\. Lang\. Process\.,pp\. 2758–2770\.Cited by:[§IV\-B](https://arxiv.org/html/2605.17359#S4.SS2.p2.1)\.
- \[39\]N\. Nazyrova, S\. Chahed, T\. Chausalet, and M\. Dwek\(2024\)Leveraging large language models for medical text classification: a hospital readmission prediction case\.InICPRS,pp\. 1–7\.External Links:[Document](https://dx.doi.org/10.1109/ICPRS62101.2024.10677826)Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p1.1)\.
- \[40\]Y\. E\. Nesterov\(2004\)Introductory lectures on convex optimization \- A basic course\.Applied Optimization, Vol\.87,Springer\.External Links:[Document](https://dx.doi.org/10.1007/978-1-4419-8853-9),ISBN 978\-1\-4613\-4691\-3Cited by:[§III\-E2](https://arxiv.org/html/2605.17359#S3.SS5.SSS2.p2.1)\.
- \[41\]M\. Nghiem, P\. Baylis, A\. Freitas, and S\. Ananiadou\(2022\)Text classification and prediction in the legal domain\.InLREC,pp\. 4717–4722\.Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p1.1)\.
- \[42\]H\. Nguyen, Y\. Liu, C\. Zhang, T\. Zhang, and P\. S\. Yu\(2023\)CoF\-cot: enhancing large language models with coarse\-to\-fine chain\-of\-thought prompting for multi\-domain NLU tasks\.InProc\. Conf\. Empir\. Methods Nat\. Lang\. Process\.,pp\. 12109–12119\.Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.17359#S2.SS1.p1.1)\.
- \[43\]N\. Nikolaidis, N\. Stefanovitch, P\. Silvano, D\. I\. Dimitrov, R\. Yangarber, N\. Guimarães, E\. Sartori, I\. Androutsopoulos, P\. Nakov, G\. D\. S\. Martino, and J\. Piskorski\(2025\)PolyNarrative: A multilingual, multilabel, multi\-domain dataset for narrative extraction from news articles\.InProc\. Annu\. Meeting Assoc\. Comput\. Linguistics,pp\. 31323–31345\.Cited by:[§II\-A](https://arxiv.org/html/2605.17359#S2.SS1.p1.1)\.
- \[44\]C\. Qian, W\. Liu, H\. Liu, N\. Chen, Y\. Dang, J\. Li, C\. Yang, W\. Chen, Y\. Su, X\. Cong, J\. Xu, D\. Li, Z\. Liu, and M\. Sun\(2024\)ChatDev: communicative agents for software development\.InProc\. Annu\. Meeting Assoc\. Comput\. Linguistics,pp\. 15174–15186\.Cited by:[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p1.1)\.
- \[45\]P\. Schafhalter, S\. Liao, Y\. Zhou, C\. Yeh, A\. Kandoor, and J\. Laudon\(2024\)Scalable multi\-domain adaptation of language models using modular experts\.CoRRabs/2410\.10181\.Cited by:[§II\-A](https://arxiv.org/html/2605.17359#S2.SS1.p1.1),[§IV\-B](https://arxiv.org/html/2605.17359#S4.SS2.p2.1)\.
- \[46\]T\. Scialom, T\. Chakrabarty, and S\. Muresan\(2022\)Fine\-tuned language models are continual learners\.InProc\. Conf\. Empir\. Methods Nat\. Lang\. Process\.,pp\. 6107–6122\.Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p2.1)\.
- \[47\]A\. Sicilia, K\. Atwell, M\. Alikhani, and S\. J\. Hwang\(2022\)PAC\-bayesian domain adaptation bounds for multiclass learners\.InProc\. Conf\. Uncertainty Artif\. Intell\.,pp\. 1824–1834\.Cited by:[§III\-E1](https://arxiv.org/html/2605.17359#S3.SS5.SSS1.p3.1),[§III\-E1](https://arxiv.org/html/2605.17359#S3.SS5.SSS1.p4.1)\.
- \[48\]H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar, A\. Rodriguez, A\. Joulin, E\. Grave, and G\. Lample\(2023\)LLaMA: open and efficient foundation language models\.Cited by:[§IV\-B](https://arxiv.org/html/2605.17359#S4.SS2.p1.1),[§IV\-C](https://arxiv.org/html/2605.17359#S4.SS3.p2.2)\.
- \[49\]L\. van der Maaten and G\. Hinton\(2008\)Visualizing data using t\-sne\.Journal of Machine Learning Research9\(86\),pp\. 2579–2605\.Cited by:[§IV\-G](https://arxiv.org/html/2605.17359#S4.SS7.SSS0.Px1.p1.1)\.
- \[50\]Z\. Wang, Y\. Wang, X\. Liu, L\. Ding, M\. Zhang, J\. Liu, and M\. Zhang\(2025\)AgentDropout: dynamic agent elimination for token\-efficient and high\-performance llm\-based multi\-agent collaboration\.InProc\. Annu\. Meeting Assoc\. Comput\. Linguistics,pp\. 24013–24035\.Cited by:[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p2.1),[§IV\-B](https://arxiv.org/html/2605.17359#S4.SS2.p3.1),[§IV\-C](https://arxiv.org/html/2605.17359#S4.SS3.p4.3),[§IV\-C](https://arxiv.org/html/2605.17359#S4.SS3.p5.2),[footnote 1](https://arxiv.org/html/2605.17359#footnote1)\.
- \[51\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InProc\. Adv\. Neural Inf\. Process\. Syst\.,Cited by:[§IV\-B](https://arxiv.org/html/2605.17359#S4.SS2.p1.1)\.
- \[52\]Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu, A\. H\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang\(2024\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversations\.InProc\. Conf\. Lang\. Model\.,Cited by:[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p1.1),[§IV\-I1](https://arxiv.org/html/2605.17359#S4.SS9.SSS1.p1.1)\.
- \[53\]A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu\(2024\)Qwen2\.5 technical report\.CoRRabs/2412\.15115\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2412.15115)Cited by:[§IV\-B](https://arxiv.org/html/2605.17359#S4.SS2.p1.1),[§IV\-C](https://arxiv.org/html/2605.17359#S4.SS3.p2.2)\.
- \[54\]G\. Zhang, Y\. Yue, Z\. Li, S\. Yun, G\. Wan, K\. Wang, D\. Cheng, J\. X\. Yu, and T\. Chen\(2025\)Cut the crap: an economical communication pipeline for llm\-based multi\-agent systems\.InProc\. Int\. Conf\. Learn\. Repre\.,Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p2.1),[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p2.1),[§IV\-B](https://arxiv.org/html/2605.17359#S4.SS2.p3.1),[§IV\-C](https://arxiv.org/html/2605.17359#S4.SS3.p5.2)\.
- \[55\]G\. Zhang, Y\. Yue, X\. Sun, M\. Yu, K\. Wang, T\. Chen, and D\. Cheng\(2025\)G\-designer: architecting multi\-agent communication topologies via graph neural networks\.InProc\. Int\. Conf\. Mach\. Learn\.,Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p2.1),[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p2.1),[§IV\-B](https://arxiv.org/html/2605.17359#S4.SS2.p3.1)\.
- \[56\]J\. Zhang, X\. Xu, N\. Zhang, R\. Liu, B\. Hooi, and S\. Deng\(2024\)Exploring collaboration mechanisms for LLM agents: A social psychology view\.InProc\. Annu\. Meeting Assoc\. Comput\. Linguistics,pp\. 14544–14607\.Cited by:[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p1.1)\.
- \[57\]Q\. Zhang, D\. Wang, H\. Qian, Y\. Li, T\. Zhang, M\. Huang, K\. Xu, H\. Li, L\. Yan, and H\. Qiu\(2025\)Understanding the dark side of llms’ intrinsic self\-correction\.InProc\. Annu\. Meeting Assoc\. Comput\. Linguistics,pp\. 27066–27101\.Cited by:[§I](https://arxiv.org/html/2605.17359#S1.p2.1)\.
- \[58\]H\. Zhao, R\. T\. des Combes, K\. Zhang, and G\. J\. Gordon\(2019\)On learning invariant representations for domain adaptation\.InProc\. Int\. Conf\. Mach\. Learn\.,pp\. 7523–7532\.Cited by:[§III\-E1](https://arxiv.org/html/2605.17359#S3.SS5.SSS1.p3.1),[§III\-E1](https://arxiv.org/html/2605.17359#S3.SS5.SSS1.p4.1)\.
- \[59\]Z\. Zhou, B\. Hu, P\. Zhang, C\. Zhao, and B\. Liu\(2023\)Large language model is a good policy teacher for training reinforcement learning agents\.CoRRabs/2311\.13373\.Cited by:[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p1.1)\.
- \[60\]M\. Zhu, Z\. Xu, K\. Zeng, K\. Xiao, M\. Wang, W\. Ke, and H\. Huang\(2024\)CMNEE: A large\-scale document\-level event extraction dataset based on open\-source chinese military news\.InProc\. Int\. Conf\. Comput\. Linguist\.,pp\. 3367–3379\.Cited by:[§IV\-G](https://arxiv.org/html/2605.17359#S4.SS7.SSS0.Px2.p1.2)\.
- \[61\]M\. Zhuge, W\. Wang, L\. Kirsch, F\. Faccio, D\. Khizbullin, and J\. Schmidhuber\(2024\)GPTSwarm: language agents as optimizable graphs\.InProc\. Int\. Conf\. Mach\. Learn\.,Cited by:[§II\-B](https://arxiv.org/html/2605.17359#S2.SS2.p2.1)\.

## Appendix AProofs of Theoretical Results

In this appendix, we provide proofs and supporting statements for the analytical results in Section[III\-E](https://arxiv.org/html/2605.17359#S3.SS5)\. The analysis formalizes two core intuitions underlyingTopoPrior: \(i\) latent\-space alignment can support cross\-domain generalization, and \(ii\) improved topology initialization can reduce the number of refinement rounds and the associated token cost\.

Let𝒢\\mathcal\{G\}denote the space of directed labeled graphs corresponding to agent collaboration topologies\. For any queryqq, domain labeldd, and ground\-truth answeryy, letp∗\(⋅∣q,d\)p^\{\*\}\(\\cdot\\mid q,d\)denote an unknown high\-utility distribution over collaboration graphs for that query and domain\. Our learned generator is denoted bypθ\(⋅∣q\)p\_\{\\theta\}\(\\cdot\\mid q\)\. When executed with a graphGG, the downstream multi\-agent system produces a random outputy^\\hat\{y\}with distributionπMAS\(⋅∣q,G\)\\pi\_\{\\mathrm\{MAS\}\}\(\\cdot\\mid q,G\)\. The task lossℓ​\(y^,y\)∈\[0,1\]\\ell\(\\hat\{y\},y\)\\in\[0,1\]measures the discrepancy between the system output and the ground truth\.

We define the*accuracy–token utility*as

Jλ​\(G;q,y\)=1−𝔼y^∼πMAS\(⋅∣q,G\)​\[ℓ​\(y^,y\)\]−λ​C~​\(G;q\),J\_\{\\lambda\}\(G;q,y\)=1\-\\mathbb\{E\}\_\{\\hat\{y\}\\sim\\pi\_\{\\mathrm\{MAS\}\}\(\\cdot\\mid q,G\)\}\[\\ell\(\\hat\{y\},y\)\]\-\\lambda\\widetilde\{C\}\(G;q\),whereC~​\(G;q\)=C​\(G;q\)/Cmax\\widetilde\{C\}\(G;q\)=C\(G;q\)/C\_\{\\max\}is the normalized token cost andλ≥0\\lambda\\geq 0is a trade\-off coefficient between correctness and communication cost\. Sinceℓ​\(y^,y\)∈\[0,1\]\\ell\(\\hat\{y\},y\)\\in\[0,1\]andC~​\(G;q\)≥0\\widetilde\{C\}\(G;q\)\\geq 0, the utilityJλJ\_\{\\lambda\}takes values in\[−λ,1\]\[\-\\lambda,1\]\.

### A\-AProofs for Section[III\-E1](https://arxiv.org/html/2605.17359#S3.SS5.SSS1): Cross\-Domain Transfer via Latent Alignment

LetZZbe the latent space induced by the variational encoder\. For each source domaink∈\{1,…,K\}k\\in\\\{1,\\dots,K\\\}, define the marginal latent distribution

Pk​\(z\)=𝔼\(q,y\)∼𝒟k​𝔼G∗∼p∗\(⋅∣q,k\)​\[qϕ​\(z∣q,G∗\)\],P\_\{k\}\(z\)=\\mathbb\{E\}\_\{\(q,y\)\\sim\\mathcal\{D\}\_\{k\}\}\\;\\mathbb\{E\}\_\{G^\{\*\}\\sim p^\{\*\}\(\\cdot\\mid q,k\)\}\\bigl\[q\_\{\\phi\}\(z\\mid q,G^\{\*\}\)\\bigr\],whereqϕ​\(z∣q,G\)q\_\{\\phi\}\(z\\mid q,G\)denotes the encoder distribution\. For a target domaintt, we analogously definePt​\(z\)P\_\{t\}\(z\)\.

We consider a hypothesis classℋ\\mathcal\{H\}of functionsh:Z→𝒴h:Z\\to\\mathcal\{Y\}\. For a hypothesishh, the expected error on domainddis

ϵd​\(h\)=𝔼z∼Pd​\[ℓh​\(z\)\],\\epsilon\_\{d\}\(h\)=\\mathbb\{E\}\_\{z\\sim P\_\{d\}\}\[\\ell\_\{h\}\(z\)\],whereℓh\\ell\_\{h\}is a bounded loss\.

To quantify discrepancy between latent distributions, we use theℋ​Δ​ℋ\\mathcal\{H\}\\Delta\\mathcal\{H\}\-divergence\.

###### Definition 3\(ℋ​Δ​ℋ\\mathcal\{H\}\\Delta\\mathcal\{H\}\-divergence\)\.

For two distributionsPPandQQonZZ,

dℋ​Δ​ℋ​\(P,Q\)=2​suph,h′∈ℋ\|Prz∼P⁡\[h​\(z\)≠h′​\(z\)\]−Prz∼Q\[h\(z\)≠h′\(z\)\]\|\.d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P,Q\)=2\\sup\_\{h,h^\{\\prime\}\\in\\mathcal\{H\}\}\\Bigl\|\\Pr\_\{z\\sim P\}\[h\(z\)\\neq h^\{\\prime\}\(z\)\]\\\\ \-\\Pr\_\{z\\sim Q\}\[h\(z\)\\neq h^\{\\prime\}\(z\)\]\\Bigr\|\.

This divergence measures the largest change in pairwise hypothesis disagreement between the two distributions\.

###### Proof of Theorem[1](https://arxiv.org/html/2605.17359#Thmtheorem1)\.

Letα=\(α1,…,αK\)\\alpha=\(\\alpha\_\{1\},\\dots,\\alpha\_\{K\}\)be convex weights such thatαk≥0\\alpha\_\{k\}\\geq 0and∑k=1Kαk=1\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}=1\. Define the source\-mixture distribution

Pα=∑k=1Kαk​PkP\_\{\\alpha\}=\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}P\_\{k\}and the corresponding weighted source error

ϵα​\(h\)=∑k=1Kαk​ϵk​\(h\)\.\\epsilon\_\{\\alpha\}\(h\)=\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\\epsilon\_\{k\}\(h\)\.
Let

h∗∈arg⁡minh∈ℋ⁡\(ϵα​\(h\)\+ϵt​\(h\)\),h^\{\*\}\\in\\arg\\min\_\{h\\in\\mathcal\{H\}\}\\bigl\(\\epsilon\_\{\\alpha\}\(h\)\+\\epsilon\_\{t\}\(h\)\\bigr\),and define

λα,t∗=ϵα​\(h∗\)\+ϵt​\(h∗\)\.\\lambda\_\{\\alpha,t\}^\{\*\}=\\epsilon\_\{\\alpha\}\(h^\{\*\}\)\+\\epsilon\_\{t\}\(h^\{\*\}\)\.
Starting from

ϵt​\(h\)=ϵα​\(h\)\+\(ϵt​\(h\)−ϵt​\(h∗\)\)\+\(ϵt​\(h∗\)\+ϵα​\(h∗\)\)\+\(ϵα​\(h∗\)−ϵα​\(h\)\),\\epsilon\_\{t\}\(h\)=\\epsilon\_\{\\alpha\}\(h\)\+\\bigl\(\\epsilon\_\{t\}\(h\)\-\\epsilon\_\{t\}\(h^\{\*\}\)\\bigr\)\+\\bigl\(\\epsilon\_\{t\}\(h^\{\*\}\)\+\\epsilon\_\{\\alpha\}\(h^\{\*\}\)\\bigr\)\+\\bigl\(\\epsilon\_\{\\alpha\}\(h^\{\*\}\)\-\\epsilon\_\{\\alpha\}\(h\)\\bigr\),we obtain

ϵt​\(h\)≤ϵα​\(h\)\+\|ϵt​\(h\)−ϵt​\(h∗\)−\(ϵα​\(h\)−ϵα​\(h∗\)\)\|\+λα,t∗\.\\epsilon\_\{t\}\(h\)\\leq\\epsilon\_\{\\alpha\}\(h\)\+\\left\|\\epsilon\_\{t\}\(h\)\-\\epsilon\_\{t\}\(h^\{\*\}\)\-\\bigl\(\\epsilon\_\{\\alpha\}\(h\)\-\\epsilon\_\{\\alpha\}\(h^\{\*\}\)\\bigr\)\\right\|\+\\lambda\_\{\\alpha,t\}^\{\*\}\.
Using the standard multi\-source domain adaptation bound based onℋ​Δ​ℋ\\mathcal\{H\}\\Delta\\mathcal\{H\}, the absolute term is bounded by

12​dℋ​Δ​ℋ​\(Pα,Pt\)\.\\frac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{\\alpha\},P\_\{t\}\)\.Therefore,

ϵt​\(h\)≤ϵα​\(h\)\+12​dℋ​Δ​ℋ​\(Pα,Pt\)\+λα,t∗\.\\epsilon\_\{t\}\(h\)\\leq\\epsilon\_\{\\alpha\}\(h\)\+\\frac\{1\}\{2\}d\_\{\\mathcal\{H\}\\Delta\\mathcal\{H\}\}\(P\_\{\\alpha\},P\_\{t\}\)\+\\lambda\_\{\\alpha,t\}^\{\*\}\.Substitutingϵα​\(h\)=∑k=1Kαk​ϵk​\(h\)\\epsilon\_\{\\alpha\}\(h\)=\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\\epsilon\_\{k\}\(h\)concludes the proof\. ∎

##### Interpretation\.

Theorem[1](https://arxiv.org/html/2605.17359#Thmtheorem1)bounds the target\-domain error by three terms: the weighted source\-domain error, the latent\-space divergence between the source mixture and the target domain, and the best joint error achievable by the hypothesis class\. InTopoPrior, the adversarial regularizer is intended to reduce latent\-space domain discrimination, thereby lowering divergence and supporting transfer to unseen domains\.

### A\-BProofs for Section[III\-E2](https://arxiv.org/html/2605.17359#S3.SS5.SSS2): Topology Initialization as Search Acceleration

We now analyze how improved initialization can reduce refinement rounds and token cost\. Let

Ut=𝔼​\[Jλ​\(Gt;q,y\)\]U\_\{t\}=\\mathbb\{E\}\[J\_\{\\lambda\}\(G\_\{t\};q,y\)\]be the expected utility at evolution steptt, starting fromG0G\_\{0\}, and letU∗U^\{\*\}denote the optimal utility achievable by the evolution procedure\.

###### Proof of Theorem[2](https://arxiv.org/html/2605.17359#Thmtheorem2)\.

Under Assumption[1](https://arxiv.org/html/2605.17359#Thmassumption1), for alltt,

U∗−Ut\+1≤\(1−η\)​\(U∗−Ut\)\.U^\{\*\}\-U\_\{t\+1\}\\leq\(1\-\\eta\)\(U^\{\*\}\-U\_\{t\}\)\.Recursively applying this inequality forTTsteps gives

U∗−UT≤\(1−η\)T​\(U∗−U0\)\.U^\{\*\}\-U\_\{T\}\\leq\(1\-\\eta\)^\{T\}\(U^\{\*\}\-U\_\{0\}\)\.To ensureU∗−UT≤ϵU^\{\*\}\-U\_\{T\}\\leq\\epsilon, it suffices to require

\(1−η\)T​\(U∗−U0\)≤ϵ,\(1\-\\eta\)^\{T\}\(U^\{\*\}\-U\_\{0\}\)\\leq\\epsilon,which is equivalent to

T≥log⁡\(\(U∗−U0\)/ϵ\)log⁡\(1/\(1−η\)\),T\\geq\\frac\{\\log\\bigl\(\(U^\{\*\}\-U\_\{0\}\)/\\epsilon\\bigr\)\}\{\\log\\bigl\(1/\(1\-\\eta\)\\bigr\)\},since0<1−η<10<1\-\\eta<1\. This completes the proof\. ∎

###### Proof of Corollary[1](https://arxiv.org/html/2605.17359#Thmcorollary1)\.

Applying Theorem[2](https://arxiv.org/html/2605.17359#Thmtheorem2)to prior\-based and scratch initialization gives

Tprior​\(ϵ\)=log⁡\(\(U∗−U0prior\)/ϵ\)log⁡\(1/\(1−η\)\),T\_\{\\mathrm\{prior\}\}\(\\epsilon\)=\\frac\{\\log\\bigl\(\(U^\{\*\}\-U\_\{0\}^\{\\mathrm\{prior\}\}\)/\\epsilon\\bigr\)\}\{\\log\(1/\(1\-\\eta\)\)\},Tscratch​\(ϵ\)=log⁡\(\(U∗−U0scratch\)/ϵ\)log⁡\(1/\(1−η\)\)\.T\_\{\\mathrm\{scratch\}\}\(\\epsilon\)=\\frac\{\\log\\bigl\(\(U^\{\*\}\-U\_\{0\}^\{\\mathrm\{scratch\}\}\)/\\epsilon\\bigr\)\}\{\\log\(1/\(1\-\\eta\)\)\}\.Subtracting yields

Tprior​\(ϵ\)−Tscratch​\(ϵ\)=log⁡\(U∗−U0priorU∗−U0scratch\)log⁡\(1/\(1−η\)\)\.T\_\{\\mathrm\{prior\}\}\(\\epsilon\)\-T\_\{\\mathrm\{scratch\}\}\(\\epsilon\)=\\frac\{\\log\\\!\\left\(\\frac\{U^\{\*\}\-U\_\{0\}^\{\\mathrm\{prior\}\}\}\{U^\{\*\}\-U\_\{0\}^\{\\mathrm\{scratch\}\}\}\\right\)\}\{\\log\(1/\(1\-\\eta\)\)\}\.SinceU0prior\>U0scratchU\_\{0\}^\{\\mathrm\{prior\}\}\>U\_\{0\}^\{\\mathrm\{scratch\}\}, the numerator is negative and the denominator is positive, soTprior​\(ϵ\)<Tscratch​\(ϵ\)T\_\{\\mathrm\{prior\}\}\(\\epsilon\)<T\_\{\\mathrm\{scratch\}\}\(\\epsilon\)\. ∎

Next, we formalize token\-cost savings from fewer refinement rounds and sparser initial graphs\.

###### Assumption 2\(Token cost bounded by graph size\)\.

There exist constantsaV,aE,b≥0a\_\{V\},a\_\{E\},b\\geq 0such that for any graphGGand queryqq,

C​\(G;q\)≤aV​\|V​\(G\)\|\+aE​\|E​\(G\)\|\+b\.C\(G;q\)\\leq a\_\{V\}\|V\(G\)\|\+a\_\{E\}\|E\(G\)\|\+b\.

###### Assumption 3\(Bounded graph size during evolution\)\.

For all stepsttbefore termination,

𝔼​\[\|V​\(Gt\)\|\+\|E​\(Gt\)\|\]≤M\\mathbb\{E\}\[\|V\(G\_\{t\}\)\|\+\|E\(G\_\{t\}\)\|\]\\leq Mfor some constantM≥1M\\geq 1\.

###### Proposition 1\(Total expected token cost overTTrounds\)\.

Under Assumptions[2](https://arxiv.org/html/2605.17359#Thmassumption2)and[3](https://arxiv.org/html/2605.17359#Thmassumption3),

𝔼​\[∑t=0T−1C​\(Gt;q\)\]≤T​\(\(aV\+aE\)​M\+b\)\.\\mathbb\{E\}\\\!\\left\[\\sum\_\{t=0\}^\{T\-1\}C\(G\_\{t\};q\)\\right\]\\leq T\\bigl\(\(a\_\{V\}\+a\_\{E\}\)M\+b\\bigr\)\.

###### Proof\.

From Assumption[2](https://arxiv.org/html/2605.17359#Thmassumption2), we have

C​\(Gt;q\)≤aV​\|V​\(Gt\)\|\+aE​\|E​\(Gt\)\|\+b\.C\(G\_\{t\};q\)\\leq a\_\{V\}\|V\(G\_\{t\}\)\|\+a\_\{E\}\|E\(G\_\{t\}\)\|\+b\.Taking expectations on both sides and using Assumption[3](https://arxiv.org/html/2605.17359#Thmassumption3), we obtain

𝔼​\[C​\(Gt;q\)\]≤aV​𝔼​\[\|V​\(Gt\)\|\]\+aE​𝔼​\[\|E​\(Gt\)\|\]\+b≤\(aV\+aE\)​M\+b\.\\mathbb\{E\}\[C\(G\_\{t\};q\)\]\\leq a\_\{V\}\\mathbb\{E\}\[\|V\(G\_\{t\}\)\|\]\+a\_\{E\}\\mathbb\{E\}\[\|E\(G\_\{t\}\)\|\]\+b\\leq\(a\_\{V\}\+a\_\{E\}\)M\+b\.Summing overt=0,…,T−1t=0,\\dots,T\-1establishes the result\. ∎

###### Corollary 2\(Token savings from fewer rounds\)\.

IfTopoPriorreduces the number of rounds fromTscratchT\_\{\\mathrm\{scratch\}\}toTpriorT\_\{\\mathrm\{prior\}\}, then the reduction in the bound of Proposition[1](https://arxiv.org/html/2605.17359#Thmproposition1)is

\(Tscratch−Tprior\)​\(\(aV\+aE\)​M\+b\)\.\\bigl\(T\_\{\\mathrm\{scratch\}\}\-T\_\{\\mathrm\{prior\}\}\\bigr\)\\bigl\(\(a\_\{V\}\+a\_\{E\}\)M\+b\\bigr\)\.

###### Assumption 4\(Token cost monotonicity in graph size\)\.

For fixedqq,C​\(G;q\)C\(G;q\)is non\-decreasing in both\|V​\(G\)\|\|V\(G\)\|and\|E​\(G\)\|\|E\(G\)\|\.

###### Proposition 2\(Sparser initialization reduces first\-round cost\)\.

Under Assumption[4](https://arxiv.org/html/2605.17359#Thmassumption4), if the prior\-initialized graphG0priorG\_\{0\}^\{\\mathrm\{prior\}\}satisfies

𝔼​\[\|V​\(G0prior\)\|\]≤𝔼​\[\|V​\(G0scratch\)\|\],\\mathbb\{E\}\[\|V\(G\_\{0\}^\{\\mathrm\{prior\}\}\)\|\]\\leq\\mathbb\{E\}\[\|V\(G\_\{0\}^\{\\mathrm\{scratch\}\}\)\|\],𝔼​\[\|E​\(G0prior\)\|\]≤𝔼​\[\|E​\(G0scratch\)\|\],\\mathbb\{E\}\[\|E\(G\_\{0\}^\{\\mathrm\{prior\}\}\)\|\]\\leq\\mathbb\{E\}\[\|E\(G\_\{0\}^\{\\mathrm\{scratch\}\}\)\|\],then

𝔼​\[C​\(G0prior;q\)\]≤𝔼​\[C​\(G0scratch;q\)\]\.\\mathbb\{E\}\[C\(G\_\{0\}^\{\\mathrm\{prior\}\};q\)\]\\leq\\mathbb\{E\}\[C\(G\_\{0\}^\{\\mathrm\{scratch\}\};q\)\]\.

###### Proof\.

By Assumption[4](https://arxiv.org/html/2605.17359#Thmassumption4), smaller or equal node and edge counts imply no greater token cost\. Taking expectations preserves the inequality\. ∎

##### Interpretation\.

These results highlight two complementary efficiency effects\. First, under Assumption[1](https://arxiv.org/html/2605.17359#Thmassumption1), better initialization reduces the number of refinement rounds needed to achieve a target utility\. Second, if the initialized graph is also sparser, the per\-round token cost decreases\. Together, these results provide analytical support for the efficiency trends observed in our experiments\.

Similar Articles

TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination

arXiv cs.LG

This paper identifies a structural failure mode in sequential fine-tuning of shared-context multi-agent LLM teams, formalized as compounding occupancy shift, and proposes TeamTR, a trust-region framework that resamples trajectories and enforces per-agent divergence control, achieving 7.1% average improvement over baselines.