Taming "Zombie'' Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution

arXiv cs.CL 05/19/26, 04:00 AM Papers
multi-agent-systems llm markov-model state-aware resilience agent-revive graph-evolution
Summary
Introduces AgentRevive, a Markov state-aware framework for resilient multi-agent collaboration that uses soft state transitions (Active, Standby, Terminated) to prevent premature pruning of agents that may recover, reducing token consumption while improving performance on reasoning and domain tasks.
arXiv:2605.17348v1 Announce Type: new Abstract: Recent advancements in LLM-based multi-agent systems have demonstrated remarkable collaborative capabilities across complex tasks. To improve overall efficiency, existing methods often rely on aggressive graph evolution among agents (e.g., node or edge pruning), which risks prematurely discarding valuable agents due to transient issues such as hallucinations or temporary knowledge gaps. However, such hard pruning overlooks the potential for ``zombie'' agents to recover and contribute in subsequent discussion rounds. In this paper, we propose AgentRevive, a Markov state-aware framework for resilient multi-agent evolution. Our approach dynamically manages agent collaboration through soft state transitions, implemented via two key components: (1) State-Aware Policy Learning: Agent states are divided into ``Active'', ``Standby'', and ``Terminated'' states, selectively propagating messages based on agent memory. The policy employs a risk estimator to optimize agent state transitions by assessing hallucination risk, minimizing the influence of unreliable nodes while safeguarding valuable ones. (2) State-Aware Edge Optimization: Subgraph edges are pruned according to states learned from the policy, permanently removing ``Terminated'' nodes and retaining ``Standby'' nodes for subsequent rounds to assess their potential future contributions. Extensive experiments on general reasoning, domain-specific, and hallucination challenge tasks show that our method consistently outperforms strong baselines and significantly reduces token consumption through state-aware agent scheduling.
Original Article
View Cached Full Text
Cached at: 05/19/26, 06:39 AM
# Taming “Zombie” Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution
Source: [https://arxiv.org/html/2605.17348](https://arxiv.org/html/2605.17348)
Taolin Zhang1, Pukun Zhao2, Qizhou Chen4, Jiuheng Wan1, Chen Chen2, Xiaofeng He4, Chengyu Wang3,Richang Hong111footnotemark:1 1School of Computer Science and Information Engineering, Hefei University of Technology 2Guangdong University of Finance and Economics 3Alibaba Group4East China Normal University tlzhang@hfut\.edu\.cn, chengyu\.wcy@alibaba\-inc\.com

###### Abstract

Recent advancements in LLM\-based multi\-agent systems have demonstrated remarkable collaborative capabilities across complex tasks\. To improve overall efficiency, existing methods often rely on aggressive graph evolution among agents \(e\.g\., node or edge pruning\), which risks prematurely discarding valuable agents due to transient issues such as hallucinations or temporary knowledge gaps\. However, such hard pruning overlooks the potential for “zombie” agents to recover and contribute in subsequent discussion rounds\. In this paper, we proposeAgentRevive, a Markov state\-aware framework for resilient multi\-agent evolution\. Our approach dynamically manages agent collaboration through soft state transitions, implemented via two key components: \(1\)State\-Aware Policy Learning: Agent states are divided into “Active”, “Standby”, and “Terminated” states, selectively propagating messages based on agent memory\. The policy employs a risk estimator to optimize agent state transitions by assessing hallucination risk, minimizing the influence of unreliable nodes while safeguarding valuable ones\. \(2\)State\-Aware Edge Optimization: Subgraph edges are pruned according to states learned from the policy, permanently removing “Terminated” nodes and retaining “Standby” nodes for subsequent rounds to assess their potential future contributions\. Extensive experiments on general reasoning, domain\-specific, and hallucination challenge tasks show that our method consistently outperforms strong baselines and significantly reduces token consumption through state\-aware agent scheduling\.

Taming “Zombie” Agents: A Markov State\-Aware Framework for Resilient Multi\-Agent Evolution

Taolin Zhang1, Pukun Zhao2, Qizhou Chen4, Jiuheng Wan1, Chen Chen2, Xiaofeng He4,Chengyu Wang3††thanks:C\. Wang and R\. Hong are co\-corresponding authors\.,Richang Hong111footnotemark:11School of Computer Science and Information Engineering, Hefei University of Technology2Guangdong University of Finance and Economics3Alibaba Group4East China Normal Universitytlzhang@hfut\.edu\.cn, chengyu\.wcy@alibaba\-inc\.com

![Refer to caption](https://arxiv.org/html/2605.17348v1/x1.png)Figure 1:Comparison of agent graph topology evolution between ourAgentReviveframework and strong training paradigms\. \(Best viewed in color\.\)## 1Introduction

LLM\-powered multi\-agent systems \(MAS\) have emerged as a transformative paradigm for tackling complex tasks, demonstrating superior performance over single\-agent methods through collaborative reasoning and planningLinet al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib1)\); Zhanget al\.\([2025d](https://arxiv.org/html/2605.17348#bib.bib2)\)\. The efficacy of MAS depends critically on their inter\-agent communication topologies, which govern how information is exchanged and assimilated among agentsGuoet al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib3)\); Yanet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib4)\)\. Consequently, recent research has focused on optimizing communication structures to enhance both performance and efficiencyZhanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib6)\); Wanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib8),[b](https://arxiv.org/html/2605.17348#bib.bib7)\)\.

Approaches addressing communication redundancy in MAS can be broadly categorized into three paradigms:\(1\) Vanilla MAS\.These systems rely on manually crafted communication templates, such as chains, trees, or fully connected graphsZhanget al\.\([2024b](https://arxiv.org/html/2605.17348#bib.bib9)\); Zhugeet al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib10)\); Ganet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib15)\)\. While straightforward to implement, fixed topologies lack adaptability, resulting in inflexible and often inefficient agent interactions that do not dynamically align with task\-specific demands\.\(2\) Graph\-pruning\-based MAS\.This paradigm models agent interactions as graph structures and applies topology\-aware learning to prune redundant edges or nodesWanget al\.\([2025b](https://arxiv.org/html/2605.17348#bib.bib7)\); Zhanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib6)\); Boyiet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib14)\)\. However, this “hard pruning” strategy irreversibly removes nodes and edges, potentially discarding useful but temporarily inactive “zombie” agents\. As a result, the final topology may suffer performance loss, since pruned elements cannot be reactivated even as task contexts change\.\(3\) Graph\-generation MAS\.Recent efforts explore autoregressive, dynamic agent graph generation, constructing the collaboration graph from scratch by sequentially generating agent roles and connectionsWanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib8)\); Liet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib11)\)\. While this paradigm increases flexibility and avoids initial redundancy, it operates in a purely forward\-generative manner without considering the global topological stateQianet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib12)\)\. Consequently, it may fail to reassess or reintegrate previously excluded agents that could become relevant as task conditions evolve, limiting its ability to optimize the communication structure\. As shown in Fig\.[1](https://arxiv.org/html/2605.17348#S0.F1), unlike the three paradigms above, which make permanent pruning decisions, our approach \(bottom\) allows agent reactivation in later rounds, such as “Round 3”\.

We introduceAgentRevive, a Markov state\-aware framework designed for resilient multi\-agent evolution\. Our core insight is to treat agent collaboration as a soft, state\-aware process rather than relying on hard\-pruning decisions\.AgentRevivefeatures two key components:

- •State\-Aware Policy Learning: Learns optimal state transitions for each agent node across communication rounds\. We model the agent lifecycle with three states: “Active”, “Standby”, and “Terminated”\. State transitions at each round are conditioned on the agent’s previous state, its own response, and messages from neighboring agents\. To stabilize policy learning under this Markov decision process \(MDP\), we augment the conventional reward signal, which jointly considers task performance and token efficiency, with a risk estimator\. It penalizes strategies that retain agents prone to hallucinated or contradictory responsesCemriet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib16)\); Zhanget al\.\([2025c](https://arxiv.org/html/2605.17348#bib.bib17)\), encouraging dynamic suspension of unreliable nodes without permanent removal\.
- •State\-Aware Edge Optimization: Prunes subgraph edges based on agent states learned from the policy, permanently removing “Terminated” nodes and retaining “Standby” nodes in subsequent rounds to observe their potential contribution to current tasks\. Specifically, it constructs a binary node mask based on the survival rates of each agent across multiple inferences, applied to the adjacency matrices of both spatial and temporal edges\. This yields a sparsified yet effective communication graph that balances task performance with token efficiency\.

Experiments across general reasoning, domain\-specific, and hallucination benchmarks demonstrate thatAgentReviveimproves task\-averaged performance by\+2\.33%compared to strong pruning\-based and dynamic autoregressive baselines, while reducing token overhead by15%through adaptive agent state management\.

## 2Related Work

### 2\.1Vanilla Agent Collaboration

Early works demonstrate the effectiveness of single LLM agents in reasoning and planning through structured prompting techniques like chain\-of\-thought \(CoT\)Weiet al\.\([2022](https://arxiv.org/html/2605.17348#bib.bib18)\)and self\-consistency \(SC\)Wanget al\.\([2023](https://arxiv.org/html/2605.17348#bib.bib34)\)\. Subsequent works reveal that MAS can outperform single\-agent systems by leveraging specialized capabilities through techniques ranging from majority votingChenet al\.\([2024a](https://arxiv.org/html/2605.17348#bib.bib35)\)to sophisticated interaction mechanismsChenet al\.\([2023](https://arxiv.org/html/2605.17348#bib.bib36)\)\. Recent studies have investigated various predefined communication topologies: \(1\)Non\-interactive: Independent agent operation without interaction, exemplified by LATMZhanget al\.\([2024a](https://arxiv.org/html/2605.17348#bib.bib37)\), LLM\-BlenderJianget al\.\([2023](https://arxiv.org/html/2605.17348#bib.bib38)\), and LLM\-DebateDuet al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib39)\); \(2\)Chain: Sequential information flow through connected agents, as implemented in ChatDevQianet al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib40)\), MetaGPTHonget al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib41)\), and L2MACHoltet al\.\([2023](https://arxiv.org/html/2605.17348#bib.bib42)\); \(3\)Star: Centralized coordination through a commander agent, demonstrated in AutoGenWuet al\.\([2023](https://arxiv.org/html/2605.17348#bib.bib43)\); \(4\)Tree: Hierarchical organization with root\-level management, such as SoAIshibashi and Nishimura \([2024](https://arxiv.org/html/2605.17348#bib.bib45)\)\. While these predefined templates facilitate effective MAS interaction, they inherently lack flexibility and scalability\.

### 2\.2MAS Topologies as Graphs

To improve adaptability, recent approaches have explored learning dynamic communication graphs for MAS from task data\. GPTSwarmZhugeet al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib10)\)parameterizes agent interactions with DAG topologies optimized via reinforcement learning\. DSPyKhattabet al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib47)\)is a programming model that abstracts LLM pipelines as text transformation graphs\. DyLANLiuet al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib50)\)dynamically selects agent teams for task\-specific collaboration\. EvoMACHuet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib51)\)employs environmental feedback and textual backpropagation for network updates\. However, these models cannot address redundancy in communication graph structures caused by query\-adaptive topology generation\. Graph\-pruning\-based methodsWanget al\.\([2025b](https://arxiv.org/html/2605.17348#bib.bib7)\); Zhanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib6)\); Boyiet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib14)\)remove redundant nodes and edges in the temporal and spatial dimensions of the graph based on query\-specific characteristics during dynamic topology learning, ultimately forming an adaptive sparse topology for answering the query\. Additionally, autoregressive dynamic graph generation methodsJiet al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib53)\); Wanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib8)\); Liet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib11)\)enable multi\-agent pipelines to dynamically generate decision trajectories from scratch, rather than pruning from an initial graph\.

## 3Problem Formulation

In LLM\-based multi\-agent systems \(MAS\), agents may temporarily enter a “zombie” state, i\.e\., a failure mode caused by hallucinations or knowledge gapsLinet al\.\([2025b](https://arxiv.org/html/2605.17348#bib.bib48)\)\. Previous pruning methodsZhanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib6)\); Boyiet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib14)\)treat such agents as redundant and remove them permanently\. However, if these agents recover in subsequent rounds, they can potentially contribute critically at a later stage\.111Due to space limitations, we refer readers to Appendix[A](https://arxiv.org/html/2605.17348#A1)for notations and basic task formulation descriptions\.

To address this, we propose a Markov state\-aware collaboration graph framework that dynamically manages agent states across communication rounds\. Specifically, we model MAS as a state\-aware collaboration graph𝒢=\(𝒱,ℰ𝒯,ℰ𝒮,𝐒\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}^\{\\mathcal\{T\}\},\\mathcal\{E\}^\{\\mathcal\{S\}\},\\mathbf\{S\}\), where𝐒\(t\)=\{s1\(t\),s2\(t\),…,sN\(t\)\}\\mathbf\{S\}^\{\(t\)\}=\\\{\\mathit\{s\}\_\{1\}^\{\(t\)\},\\mathit\{s\}\_\{2\}^\{\(t\)\},\\ldots,\\mathit\{s\}\_\{N\}^\{\(t\)\}\\\}denotes the state of each agent at roundtt, andsi\(t\)∈\{‘‘Active’’,‘‘Standby’’,‘‘Terminated’’\}\\mathit\{s\}\_\{i\}^\{\(t\)\}\\in\\\{\\texttt\{\`\`Active''\},\\texttt\{\`\`Standby''\},\\texttt\{\`\`Terminated''\}\\\}\. The state transition for each agent is governed by a stochastic policy:

si\(t\+1\)∼π\(⋅∣si\(t\),h\(t\),m𝒯\(t\+1\),m𝒮\(t\+1\)\)\\mathit\{s\}\_\{i\}^\{\(t\+1\)\}\\sim\\pi\\left\(\\cdot\\mid\\mathit\{s\}\_\{i\}^\{\(t\)\},h^\{\(t\)\},m\_\{\\mathcal\{T\}\}^\{\(t\+1\)\},m\_\{\\mathcal\{S\}\}^\{\(t\+1\)\}\\right\)\(1\)whereh\(t\)h^\{\(t\)\}denotes the interaction history\.

We then define the effective subgraph after policy state changes at roundttas𝒢eff\(t\)=\(𝒱eff\(t\),ℰeff\(t\)\)\\mathcal\{G\}\_\{\\text\{eff\}\}^\{\(t\)\}=\(\\mathcal\{V\}\_\{\\text\{eff\}\}^\{\(t\)\},\\mathcal\{E\}\_\{\\text\{eff\}\}^\{\(t\)\}\)\. The effective agent nodes are:

𝒱eff\(t\)=\{vi∣si\(t\)∈\{‘‘Active’’,‘‘Standby’’\}\}\\mathcal\{V\}\_\{\\text\{eff\}\}^\{\(t\)\}=\\\{\\mathit\{v\}\_\{i\}\\mid\\mathit\{s\}\_\{i\}^\{\(t\)\}\\in\\\{\\texttt\{\`\`Active''\},\\texttt\{\`\`Standby''\}\\\}\\\}\(2\)whereℰeff\(t\)\\mathcal\{E\}\_\{\\text\{eff\}\}^\{\(t\)\}comprises edges between agents in𝒱eff\(t\)\\mathcal\{V\}\_\{\\text\{eff\}\}^\{\(t\)\}\. We next reformulate communication redundancy by incorporating agent statesZhanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib6)\)\.

Definition 1 \(State\-Aware Redundancy\)\.Given a state\-aware collaboration graph𝒢=\(𝒱,ℰ𝒯,ℰ𝒮,𝐒\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}^\{\\mathcal\{T\}\},\\mathcal\{E\}^\{\\mathcal\{S\}\},\\mathbf\{S\}\), an agentvi\\mathit\{v\}\_\{i\}is considered redundant at roundttif:

si\(t\)=‘‘Terminated’’andϕ\(𝒢eff\(t\)\)≥ϕ\(𝒢\)\\mathit\{s\}\_\{i\}^\{\(t\)\}=\\texttt\{\`\`Terminated''\}\\quad\\text\{and\}\\quad\\phi\(\\mathcal\{G\}\_\{\\text\{eff\}\}^\{\(t\)\}\)\\geq\\phi\(\\mathcal\{G\}\)\(3\)whereϕ\(⋅\)\\phi\(\\cdot\)is a utility function measuring task performance\. The state\-aware pruning objective is to find a policyπ\\pithat minimizes the effective state\-aware graph size while maintaining performance:

minπ∑t=1T\|𝒢eff\(t\)\|,s\.t\.∀t\|ϕ\(𝒢eff\(t\)\)−ϕ\(𝒢\)\|≤ϵ\.\\min\_\{\\pi\}\\sum\_\{t=1\}^\{T\}\\left\|\\mathcal\{G\}\_\{\\text\{eff\}\}^\{\(t\)\}\\right\|,\\quad\\text\{s\.t\.\}\\;\\forall t\\quad\|\\phi\(\\mathcal\{G\}\_\{\\text\{eff\}\}^\{\(t\)\}\)\-\\phi\(\\mathcal\{G\}\)\|\\leq\\epsilon\.\(4\)
Table[1](https://arxiv.org/html/2605.17348#S3.T1)summarizes how our Markov state\-aware framework offers distinct advantages over conventional graph\-based approaches\.

MethodTaskAdaptiveVariableNode SizeFlexibleStateManual Design✗✗✗APZhanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib6)\)✓✗✗G\-DZhanget al\.\([2025b](https://arxiv.org/html/2605.17348#bib.bib46)\)✓✗✗ADWanget al\.\([2025b](https://arxiv.org/html/2605.17348#bib.bib7)\)✓✓✗ARG\-DLiet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib11)\)✓✓✗AgentRevive\(Ours\)✓✓✓Table 1:Comparison across MAS paradigms\.✓and✗denote full and no support for each capability\.![Refer to caption](https://arxiv.org/html/2605.17348v1/x2.png)Figure 2:Overview ofAgentRevive\. Our framework mainly consists of two stages for iteratively training: \(1\) State\-Aware Policy Learning is used to aggregate messages around nodes and train agent state policy networks\. \(2\) State\-aware Edge Optimization further optimizes the weights of edges around nodes for messages propagation\.
## 4Methodology

### 4\.1Notations

We first convert the state\-aware initial collaboration graph𝒢\\mathcal\{G\}into a trainable weighted graph𝒢~\\tilde\{\\mathcal\{G\}\}, leveraging pre\-defined spatial edgesℰ𝒮\\mathcal\{E\}^\{\\mathcal\{S\}\}and temporal edgesℰ𝒯\\mathcal\{E\}^\{\\mathcal\{T\}\}\. Each edge in the graph is assigned a trainable continuous weight in the range\[0,1\]\[0,1\]\. Let the adjacency matrix set of𝒢~\\tilde\{\\mathcal\{G\}\}be𝒜~=𝒜~𝒮∪𝒜~𝒯\\tilde\{\\mathcal\{A\}\}=\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{S\}\}\\cup\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\}, where𝒜~𝒮=⋃t𝒜~𝒮\(t\)\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{S\}\}=\\bigcup\_\{t\}\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{S\}\}^\{\(t\)\}is the subset containing same\-round adjacency matrices, where𝒜~𝒮\(t\)∈\[0,1\]N×N\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{S\}\}^\{\(t\)\}\\in\[0,1\]^\{N\\times N\}\.𝒜~𝒯=⋃t𝒜~𝒯\(t\)\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\}=\\bigcup\_\{t\}\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\}^\{\(t\)\}is the subset containing temporal\-round adjacency matrices, where𝒜~𝒯\(t\)∈\[0,1\]N×N\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\}^\{\(t\)\}\\in\[0,1\]^\{N\\times N\}represents connections between rounds\(t−1\)\(t\-1\)andtt\. The final effective inference graph𝒢eff\\mathcal\{G\}\_\{\\text\{eff\}\}is a DAG, obtained through the learned agent state\-aware policyπ\\pi\. Fig\.[2](https://arxiv.org/html/2605.17348#S3.F2)is an overview ofAgentRevive\.

### 4\.2State\-Aware Policy Learning

We introduce a paradigm shift from hard pruning to dynamic state management with two key contributions: \(1\) State\-Aware Message Passing: in contrast to static hard pruning methodsWanget al\.\([2025b](https://arxiv.org/html/2605.17348#bib.bib7)\); Zhanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib6)\), we govern graph message flow using Markov states\. Only agents with the “Active” or “Standby” state are permitted to propagate messages, forming a dynamic topology that prevents irreversible node removing\. \(2\) State\-aware Policy Decision: the policy determines the optimal state for each node at every iteration\. A critical capability is its reactivation of “zombie” agents from “Standby” to “Active” when contextually advantageousLinet al\.\([2025b](https://arxiv.org/html/2605.17348#bib.bib48)\), thereby ensuring resilience against transient failures\.

#### 4\.2\.1State\-aware Message Passing

In ourAgentRevivemodel, the aggregated message𝒵i\(t\)\\mathcal\{Z\}\_\{i\}^\{\(t\)\}for agentvi\(t\)v\_\{i\}^\{\(t\)\}at roundttintegrates both spatial and temporal information from the graph𝒢~\\tilde\{\\mathcal\{G\}\}:

𝒵i\(𝒮,\(t\)\)=∑vj\(t\)∈𝒩in𝒮\(vi\(t\)\)𝒜~𝒮\(t\)\[i,j\]⋅ℱ\(sj\(t\)\)\\displaystyle\\mathcal\{Z\}\_\{i\}^\{\(\\mathcal\{S\},\(t\)\)\}=\\sum\_\{v\_\{j\}^\{\(t\)\}\\in\\mathcal\{N\}\_\{in\}^\{\\mathcal\{S\}\}\(v\_\{i\}^\{\(t\)\}\)\}\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{S\}\}^\{\(t\)\}\[i,j\]\\cdot\\mathcal\{F\}\(\\mathit\{s\}\_\{j\}^\{\(t\)\}\)\(5\)𝒵i\(t\)=\[𝒵i\(𝒮,\(t\)\)∥𝒵i\(𝒯,\(t\)\)\]\\displaystyle\\mathcal\{Z\}\_\{i\}^\{\(t\)\}=\[\\mathcal\{Z\}\_\{i\}^\{\(\\mathcal\{S\},\(t\)\)\}\\parallel\\mathcal\{Z\}\_\{i\}^\{\(\\mathcal\{T\},\(t\)\)\}\]\(6\)where∥\\parallelmeans concatenation\.𝒵i\(𝒮,\(t\)\)\\mathcal\{Z\}\_\{i\}^\{\(\\mathcal\{S\},\(t\)\)\}is spatial neighboring messages, and the temporal messages𝒵i\(𝒯,\(t\)\)\\mathcal\{Z\}\_\{i\}^\{\(\\mathcal\{T\},\(t\)\)\}is also obtained similarly\.

After aggregating messages for each agent node, the state\-aware responseoj\(t\)\\mathit\{o\}\_\{j\}^\{\(t\)\}of nodevj\(t\)v\_\{j\}^\{\(t\)\}at current roundttrewrites Eq\.[22](https://arxiv.org/html/2605.17348#A1.E22)as follows:

oj\(t\)=\{ℱ′\(sj\(t\)\),ifsj\(t\)=“A”fc\(oj\(t−1\)\),ifsj\(t\)=“S”\\displaystyle\\mathit\{o\}\_\{j\}^\{\(t\)\}=\\begin\{cases\}\\mathcal\{F\}^\{\\prime\}\(\\mathit\{s\}\_\{j\}^\{\(t\)\}\),&\\text\{if \}\\mathit\{s\}\_\{j\}^\{\(t\)\}=\\text\{\`\`A''\}\\\\ f\_\{c\}\(\\mathit\{o\}\_\{j\}^\{\(t\-1\)\}\),&\\text\{if \}\\mathit\{s\}\_\{j\}^\{\(t\)\}=\\text\{\`\`S''\}\\end\{cases\}\(7\)ℱ′\(sj\(t\)\)=fpr\(ri\(t\),hi\(t\),𝒬,𝒵i\(t\)\)\\displaystyle\\mathcal\{F\}^\{\\prime\}\(\\mathit\{s\}\_\{j\}^\{\(t\)\}\)=f\_\{\\text\{pr\}\}\\left\(r\_\{i\}^\{\(t\)\},h\_\{i\}^\{\(t\)\},\\mathcal\{Q\},\\mathcal\{Z\}\_\{i\}^\{\(t\)\}\\right\)\(8\)whereℱ′\(sj\(t\)\)\\mathcal\{F\}^\{\\prime\}\(\\mathit\{s\}\_\{j\}^\{\(t\)\}\)is current state response\.oj\(t\)\\mathit\{o\}\_\{j\}^\{\(t\)\}is determined by the agent’s state: if the state is “Active \(A\)”, the current response is used; if the state is “Standby \(S\)”, we employ a summarized previous responsefc\(oj\(t−1\)\)f\_\{c\}\(\\mathit\{o\}\_\{j\}^\{\(t\-1\)\}\)222Since𝒜~\\tilde\{\\mathcal\{A\}\}is an adjacency matrix andℱ′\(sj\(t\)\)\\mathcal\{F\}^\{\{\}^\{\\prime\}\}\(\\mathit\{s\}\_\{j\}^\{\(t\)\}\)is a string, they are combined via weighted prompts \(see Appendix[B\.3](https://arxiv.org/html/2605.17348#A2.SS3)\)\.\. As historical messages have already been transmitted via different agent nodes in previous rounds, we employ LLM to further compress the number of tokens:

fc\(oj\(t−1\)\)=fLLM\(‘‘Summarize:’’\+oj\(t−1\)\)f\_\{c\}\(\\mathit\{o\}\_\{j\}^\{\(t\-1\)\}\)=f\_\{\\text\{LLM\}\}\(\\texttt\{\`\`Summarize:''\}\+\\mathit\{o\}\_\{j\}^\{\(t\-1\)\}\)\(9\)wherefLLMf\_\{\\text\{LLM\}\}means using LLM to limit the number of tokens for summary reasoning directly\.

#### 4\.2\.2State Policy Learning

After obtaining the responseoi\(t\)\\mathit\{o\}\_\{i\}^\{\(t\)\}from state message passing, the model determines the optimal state for agentvi\(t\)v\_\{i\}^\{\(t\)\}using its own agent memory:

si\(t\)∼πθ\(⋅∣fEnc\(si\(t−1\),oi\(t\),hi\(t\),𝒵i\(t\)\)\)s\_\{i\}^\{\(t\)\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid f\_\{\\text\{Enc\}\}\(s\_\{i\}^\{\(t\-1\)\},\\mathit\{o\}\_\{i\}^\{\(t\)\},h\_\{i\}^\{\(t\)\},\\mathcal\{Z\}\_\{i\}^\{\(t\)\}\)\)\(10\)wheresi\(t\)∈ℝ3s\_\{i\}^\{\(t\)\}\\in\\mathbb\{R\}^\{3\}\. Given the lightweight nature of our framework, we implementπθ\\pi\_\{\\theta\}as a simple MLP for state prediction, andfEnc\(⋅\)f\_\{\\text\{Enc\}\}\(\\cdot\)utilizes an LSTMHochreiter and Schmidhuber \([1997](https://arxiv.org/html/2605.17348#bib.bib52)\)to encode the full memory of agentvi\(t\)v\_\{i\}^\{\(t\)\}, denotedℳvi\(t\)\\mathcal\{M\}\_\{v\_\{i\}\}^\{\(t\)\}\.

To ensure stable policy training for agent states, we consider policy quality as dependent on both task performanceZhanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib6)\)and the level of hallucination present in node responses\. Therefore, the trajectory reward obtained from state transitions of agent nodes is defined as:

R\(τ\)=μ\(𝒢~\(T\)\)\+ηrisk⋅∑t=1Tfrisk\(t\)\\displaystyle R\(\\tau\)=\\mu\(\\tilde\{\\mathcal\{G\}\}^\{\(T\)\}\)\+\\eta\_\{risk\}\\cdot\\sum\_\{t=1\}^\{T\}f\_\{\\text\{risk\}\}^\{\(t\)\}\(11\)frisk\(t\)=−𝔼v∈𝒱A\(t\)\[𝔻KL\(ℳv\(t\)∥ℳ¯𝒱A\(t\)\)\]\\displaystyle f\_\{\\text\{risk\}\}^\{\(t\)\}=\-\\mathbb\{E\}\_\{v\\in\\mathcal\{V\}\_\{\\text\{A\}\}^\{\(t\)\}\}\\left\[\\mathbb\{D\}\_\{\\text\{KL\}\}\\left\(\\mathcal\{M\}\_\{v\}^\{\(t\)\}\\parallel\\bar\{\\mathcal\{M\}\}\_\{\\mathcal\{V\}\_\{\\text\{A\}\}\}^\{\(t\)\}\\right\)\\right\]\(12\)whereτ\\taudenotes the trajectory of agent states over roundstt;μ\(⋅\)\\mu\(\\cdot\)is a utility function measuring the final task score; andηrisk\\eta\_\{risk\}is a balancing coefficient\. Here,𝒱A\(t\)=\{vi∣si\(t\)=“A”\}\\mathcal\{V\}\_\{\\text\{A\}\}^\{\(t\)\}=\\\{v\_\{i\}\\mid s\_\{i\}^\{\(t\)\}=\\text\{\`\`A''\}\\\}denotes the set of agents in “Active” state in roundtt\.frisk\(t\)f\_\{\\text\{risk\}\}^\{\(t\)\}quantifies the hallucinatory contradiction of agent nodevi\(t\)v\_\{i\}^\{\(t\)\}in roundtt, using the KL divergence between each agent’s messageℳvi\(t\)\\mathcal\{M\}\_\{v\_\{i\}\}^\{\(t\)\}and the average messageℳ¯𝒱A\(t\)\\bar\{\\mathcal\{M\}\}\_\{\\mathcal\{V\}\_\{\\text\{A\}\}\}^\{\(t\)\}of all “Active” agents\.

### 4\.3State\-aware Edge Optimization

The strategy trajectory reward focuses on node state transitions\. Here, we further consider how these state changes affect the edge weights𝒜~\\tilde\{\\mathcal\{A\}\}, including both spatial and temporal connections\. Given the learned policyπθ\(𝒬\)\\pi\_\{\\theta\}^\{\(\\mathcal\{Q\}\)\}for query𝒬\\mathcal\{Q\}, we re\-infer the state matrix𝒮\(t\)\\mathcal\{S\}^\{\(t\)\}for each query and calculate the average survival rateωi\\omega\_\{i\}of each nodeviv\_\{i\}, defined as the proportion of non\-“Terminated \(T\)” states acrossLLinference passes\. This yields a binary node mask vector𝐦\\mathbf\{m\}for key node selection:

ωi=1L∑l=1L𝕀\(πθ\(𝒬\)\(vil\)≠“T”\)\\displaystyle\\omega\_\{i\}=\\frac\{1\}\{L\}\\sum\_\{l=1\}^\{L\}\\mathbb\{I\}\(\\pi\_\{\\theta\}^\{\(\\mathcal\{Q\}\)\}\(v\_\{i\}^\{l\}\)\\neq\\text\{\`\`T''\}\)\(13\)𝐦∈\{0,1\}N,mi=\{1ifωi≥γ0otherwise\\displaystyle\\mathbf\{m\}\\in\\\{0,1\\\}^\{N\},\\quad m\_\{i\}=\\begin\{cases\}1&\\text\{if \}\\omega\_\{i\}\\geq\\gamma\\\\ 0&\\text\{otherwise\}\\end\{cases\}\(14\)whereπθ\(𝒬\)\(vil\)\\pi\_\{\\theta\}^\{\(\\mathcal\{Q\}\)\}\(v\_\{i\}^\{l\}\)is the predicted statesils\_\{i\}^\{l\}at thell\-th inference for nodeviv\_\{i\},𝕀\\mathbb\{I\}is the indicator function, andγ\\gammais the survival threshold\.

With this mask, we construct the binary mask matrix𝐌\(𝒬\)=diag\(𝐦\)∈ℝN×N\\mathbf\{M\}^\{\(\\mathcal\{Q\}\)\}=\\text\{diag\}\(\\mathbf\{m\}\)\\in\\mathbb\{R\}^\{N\\times N\}and rewrite the state\-aware adjacency matrix:

𝒜~eff=𝐌\(𝒬\)⊙𝒜~⊙𝐌\(𝒬\)⊤\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}=\\mathbf\{M\}^\{\(\\mathcal\{Q\}\)\}\\odot\\tilde\{\\mathcal\{A\}\}\\odot\\mathbf\{M\}^\{\(\\mathcal\{Q\}\)^\{\\top\}\}\(15\)where𝒜~eff∈ℝNA×NA\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}\\in\\mathbb\{R\}^\{N\_\{A\}\\times N\_\{A\}\}, setting rows and columns for masked nodes to zero vectors\. Here,NA=∑imiN\_\{A\}=\\sum\_\{i\}m\_\{i\}denotes the number of active nodes after removing the “Terminated” nodes, and⊙\\odotis the element\-wise multiplication\. The training objective for edge optimization rewrites Eq\.[4](https://arxiv.org/html/2605.17348#S3.E4), balancing task performance with graph sparsity:

arg⁡max𝒜~eff𝔼𝒢′∼𝔾eff\[μ\(𝒢′\)\]⏟Performance−rank\(𝒜~eff\)⏟Sparsity\\underset\{\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}\}\{\\arg\\max\}\\;\\underbrace\{\\mathbb\{E\}\_\{\\mathcal\{G\}^\{\\prime\}\\sim\\mathbb\{G\}\_\{\\text\{eff\}\}\}\[\\mu\(\\mathcal\{G\}^\{\{\}^\{\\prime\}\}\)\]\}\_\{\\text\{Performance\}\}\-\\underbrace\{\\text\{rank\}\(\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}\)\}\_\{\\text\{Sparsity\}\}\(16\)where𝔾eff\\mathbb\{G\}\_\{\\text\{eff\}\}denotes the feasible domain of graph samples after masking, andμ\(𝒢′\)\\mu\(\\mathcal\{G\}^\{\{\}^\{\\prime\}\}\)is a task performance evaluation \(e\.g\., APIs\), rendering the loss function non\-differentiable\. Hence, we employ unbiased policy gradientWilliams \([1992](https://arxiv.org/html/2605.17348#bib.bib5)\)to approximate this objective, using the weighted average performance ofMMsamplesZhanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib6)\):

∇𝒜~eff𝔼𝒢′∼𝔾eff\[μ\(𝒢′\)\]≈1M∑m=1Mμ\(𝒢m′\)∇𝒜~efflog⁡\(P𝒜~eff\(𝒢m′\)\)\\nabla\_\{\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}\}\\mathbb\{E\}\_\{\\mathcal\{G\}^\{\\prime\}\\sim\\mathbb\{G\}\_\{\\text\{eff\}\}\}\[\\mu\(\\mathcal\{G\}^\{\{\}^\{\\prime\}\}\)\]\\\\ \\approx\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\mu\(\\mathcal\{G\}\_\{m\}^\{\{\}^\{\\prime\}\}\)\\nabla\_\{\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}\}\\log\\left\(P\_\{\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}\}\(\\mathcal\{G\}\_\{m\}^\{\{\}^\{\\prime\}\}\)\\right\)\(17\)whereP𝒜~eff\(𝒢m′\)P\_\{\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}\}\(\\mathcal\{G\}\_\{m\}^\{\{\}^\{\\prime\}\}\)is the probability of sampling effective subgraph𝒢m′=\(𝒱m′,ℰm𝒯,ℰm𝒮\)\\mathcal\{G\}\_\{m\}^\{\{\}^\{\\prime\}\}=\(\\mathcal\{V\}\_\{m\}^\{\{\}^\{\\prime\}\},\\mathcal\{E\}\_\{m\}^\{\\mathcal\{T\}\},\\mathcal\{E\}\_\{m\}^\{\\mathcal\{S\}\}\):

P𝒜~eff\(𝒢m′\)=∏t=1T∏\(vi,vj\)∈ℰm\(t\),𝒮𝒜~𝒮eff,\(t\)\[i,j\]×∏t=2T∏\(vi,vj\)∈ℰm\(t\),𝒯𝒜~𝒯eff,\(t\)\[i,j\]P\_\{\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}\}\(\\mathcal\{G\}\_\{m\}^\{\{\}^\{\\prime\}\}\)=\\prod\_\{t=1\}^\{T\}\\prod\_\{\\left\(v\_\{i\},v\_\{j\}\\right\)\\in\\mathcal\{E\}\_\{m\}^\{\(t\),\\mathcal\{S\}\}\}\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{S\}\}^\{\\text\{eff\},\(t\)\}\[i,j\]\\times\\\\ \\prod\_\{t=2\}^\{T\}\\prod\_\{\\left\(v\_\{i\},v\_\{j\}\\right\)\\in\\mathcal\{E\}\_\{m\}^\{\(t\),\\mathcal\{T\}\}\}\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\}^\{\\text\{eff\},\(t\)\}\[i,j\]\(18\)The second term in the objective constrains the sparsity of the communication graph\. To relax the NP\-hard rank function, we replace it with the nuclear normZhanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib6)\):

arg⁡min𝒜~eff=\{𝒜~𝒮eff,𝒜~𝒯eff\}∑t=1T‖𝒜~𝒮eff,\(t\)‖∗\+∑t=2T‖𝒜~𝒯eff,\(t\)‖∗\\underset\{\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}=\\\{\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{S\}\}^\{\\text\{eff\}\},\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\}^\{\\text\{eff\}\}\\\}\}\{\\arg\\min\}\\sum\_\{t=1\}^\{T\}\\\|\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{S\}\}^\{\\text\{eff\},\(t\)\}\\\|\_\{\*\}\+\\sum\_\{t=2\}^\{T\}\\\|\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\}^\{\\text\{eff\},\(t\)\}\\\|\_\{\*\}\(19\)where the nuclear norm enables gradient\-based optimization of graph sparsity viarank\(𝒜~eff\)\\text\{rank\}\(\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}\)\.

Dataset→\\quad\\rightarrowVar\.NSFlex\.StateMMLUGSM8KAQuATruthfulQASVAMPHumanEvalAvg\.Models↓\\quad\\downarrowBase model: Llama3\-8B\-InstructVanilla✗✗53\.5970\.2341\.6757\.5975\.0053\.3358\.57CoT✗✗56\.86\(↑3\.27\)70\.47\(↑0\.24\)43\.75\(↑2\.08\)59\.25\(↑1\.66\)76\.17\(↑1\.17\)54\.17\(↑0\.84\)60\.11\(↑1\.54\)SC \(CoT\)✗✗60\.45\(↑6\.86\)71\.59\(↑1\.36\)46\.21\(↑4\.54\)59\.07\(↑1\.48\)78\.03\(↑3\.03\)55\.46\(↑2\.13\)61\.80\(↑3\.23\)MASround=1\{\}\_\{\\text\{round\}=1\}✗✗56\.21\(↑2\.62\)69\.30\(↓0\.93\)45\.29\(↑3\.62\)59\.88\(↑2\.29\)76\.67\(↑1\.67\)48\.33\(↓5\.00\)59\.28\(↑0\.71\)MASround=T\{\}\_\{\\text\{round\}=T\}✗✗60\.13\(↑6\.54\)71\.48\(↑1\.25\)45\.41\(↑3\.74\)60\.14\(↑2\.55\)77\.56\(↑2\.56\)49\.17\(↓4\.16\)60\.65\(↑2\.08\)G\-Designer✗✗60\.27\(↑6\.68\)70\.59\(↑0\.36\)46\.82\(↑5\.15\)62\.43\(↑4\.84\)80\.03\(↑5\.03\)52\.53\(↓0\.80\)62\.28\(↑3\.71\)AgentPrune✗✗60\.78\(↑7\.19\)71\.02\(↑0\.79\)47\.22\(↑5\.55\)62\.83\(↑5\.24\)78\.34\(↑3\.34\)51\.67\(↓1\.66\)61\.98\(↑3\.41\)ARG\-Designer✓✗61\.49\(↑7\.90\)72\.74\(↑2\.51\)46\.23\(↑4\.56\)61\.78\(↑4\.19\)79\.38\(↑4\.38\)53\.62\(↑0\.29\)62\.54\(↑3\.97\)AgentDropout✓✗62\.75\(↑9\.16\)73\.13\(↑2\.90\)47\.78\(↑6\.11\)63\.62\(↑6\.03\)80\.11\(↑5\.11\)55\.84\(↑2\.51\)63\.87\(↑5\.30\)AgentRevive✓✓64\.30\(↑10\.71\)75\.81\(↑5\.58\)50\.76\(↑9\.09\)65\.49\(↑7\.90\)82\.68\(↑7\.68\)58\.15\(↑4\.82\)66\.20\(↑7\.63\)Base model: Deepseek\-V3\-671B\-InstructVanilla✗✗84\.9794\.6884\.5864\.7093\.6788\.4385\.17CoT✗✗84\.31\(↓0\.66\)95\.15\(↑0\.47\)85\.42\(↑0\.84\)64\.99\(↑0\.29\)93\.94\(↑0\.27\)89\.26\(↑0\.83\)85\.51\(↑0\.34\)SC \(CoT\)✗✗88\.79\(↑3\.82\)95\.17\(↑0\.49\)87\.85\(↑3\.27\)65\.16\(↑0\.46\)94\.55\(↑0\.88\)90\.61\(↑2\.18\)87\.02\(↑1\.85\)AutoGen✗✗88\.03\(↑3\.06\)94\.96\(↑0\.28\)86\.71\(↑2\.13\)66\.63\(↑1\.93\)93\.82\(↑0\.15\)89\.26\(↑0\.83\)86\.57\(↑1\.40\)AgentVerse✗✗87\.65\(↑2\.68\)95\.68\(↑1\.00\)85\.90\(↑1\.32\)65\.89\(↑1\.19\)94\.21\(↑0\.54\)88\.94\(↑0\.51\)86\.38\(↑1\.21\)MASround=1\{\}\_\{\\text\{round\}=1\}✗✗89\.98\(↑5\.01\)95\.54\(↑0\.86\)86\.67\(↑2\.09\)64\.34\(↓0\.36\)93\.50\(↓0\.17\)89\.17\(↑0\.74\)86\.53\(↑1\.36\)MASround=T\{\}\_\{\\text\{round\}=T\}✗✗89\.54\(↑4\.57\)95\.49\(↑0\.81\)87\.50\(↑2\.92\)66\.05\(↑1\.35\)94\.33\(↑0\.66\)89\.26\(↑0\.83\)87\.03\(↑1\.86\)G\-Designer✗✗88\.74\(↑3\.77\)94\.93\(↑0\.25\)87\.61\(↑3\.03\)68\.70\(↑4\.00\)94\.75\(↑1\.08\)90\.20\(↑1\.77\)87\.49\(↑2\.32\)AgentPrune✗✗90\.20\(↑5\.23\)95\.49\(↑0\.81\)87\.92\(↑3\.34\)69\.23\(↑4\.53\)95\.00\(↑1\.33\)90\.91\(↑2\.48\)88\.13\(↑2\.96\)ARG\-Designer✓✗90\.04\(↑5\.07\)95\.71\(↑1\.03\)87\.96\(↑3\.38\)68\.44\(↑3\.74\)94\.98\(↑1\.31\)91\.18\(↑2\.75\)88\.05\(↑2\.88\)AgentDropout✓✗90\.85\(↑5\.88\)95\.63\(↑0\.95\)88\.33\(↑3\.75\)70\.15\(↑5\.45\)95\.79\(↑2\.12\)91\.74\(↑3\.31\)88\.75\(↑3\.58\)AgentRevive✓✓91\.60\(↑6\.63\)96\.48\(↑1\.80\)88\.85\(↑4\.27\)72\.36\(↑7\.66\)97\.07\(↑3\.40\)93\.52\(↑5\.09\)90\.15\(↑4\.98\)Table 2:Results comparison betweenAgentReviveand baselines\.Var\. NSandFlex\. Statedenote the Variable Node Size and Flexible State MAS described in Table[1](https://arxiv.org/html/2605.17348#S3.T1)\. The Qwen2\.5\-72B results are shown in Appendix[D](https://arxiv.org/html/2605.17348#A4)\.
### 4\.4Training

OurAgentRevivemodel is trained iteratively in two stages\. The differentiation process for the edge loss functionℒedge\\mathcal\{L\}\_\{\\text\{edge\}\}inStage 2is described above\. ForStage 1, we compute the state lossℒstate\\mathcal\{L\}\_\{\\text\{state\}\}using the REINFORCE algorithmWilliams \([1992](https://arxiv.org/html/2605.17348#bib.bib5)\):

∇θℒstate\(θ\)≈1M∑m=1M∑t=1T∇θlog⁡πθ\(si\(t\)\|ℳvi\(t\)\)⋅\(R\(τ\)−b\)\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{state\}\(\\theta\)\\approx\\\\ \\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\sum\_\{t=1\}^\{T\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(s\_\{i\}^\{\(t\)\}\|\\mathcal\{M\}\_\{v\_\{i\}\}^\{\(t\)\}\)\\cdot\(R\(\\tau\)\-b\)\(20\)wherebbis a baseline value for variance reduction\. The overall training procedure forAgentReviveconsists of first optimizingℒstate\\mathcal\{L\}\_\{\\text\{state\}\}, followed by optimization ofℒedge\\mathcal\{L\}\_\{\\text\{edge\}\}\. Refer to Appendix[C](https://arxiv.org/html/2605.17348#A3)for detailed algorithmic training and inference processes\.

## 5Experiments

Due to space limitations, we describe datasets, baselines, and implementation in Appendix[B](https://arxiv.org/html/2605.17348#A2)\.

### 5\.1Main Results

Table[2](https://arxiv.org/html/2605.17348#S4.T2)shows the overall performance in various task settings\. Key observations include:\[1\]Vanilla methods \(CoTWeiet al\.\([2022](https://arxiv.org/html/2605.17348#bib.bib18)\), SCWanget al\.\([2023](https://arxiv.org/html/2605.17348#bib.bib34)\)\) outperform standard prompting but show limited gains on complex tasks due to single\-agent constraints\.\[2\]Fixed\-topology MAS methods \(e\.g\.,MASround=1\\text\{MAS\}\_\{\\text\{round\}=1\},MASround=T\\text\{MAS\}\_\{\\text\{round\}=T\}, AutoGenWuet al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib32)\), AgentVerseChenet al\.\([2024b](https://arxiv.org/html/2605.17348#bib.bib20)\)\) underperform single\-agent prompting on MMLUHendryckset al\.\([2021](https://arxiv.org/html/2605.17348#bib.bib31)\)and HumanEvalChenet al\.\([2021](https://arxiv.org/html/2605.17348#bib.bib30)\), due to inefficient communication overheadXuanet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib27)\); Changet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib25)\)and error propagation in multi\-round setups\.\[3\]Graph\-based dynamic MAS methods improve over vanilla MAS via adaptive topologies, though hard pruning risks the permanent loss of potentially useful agents\.\[4\]AgentReviveachieves SOTA results, with notable gains on TruthfulQALinet al\.\([2022](https://arxiv.org/html/2605.17348#bib.bib21)\), by dynamically managing agent states to mitigate hallucinations while preserving recovery potential\.333The detailed analysis is shown in Appendix[D](https://arxiv.org/html/2605.17348#A4)\.

Dataset→\\quad\\rightarrowMMLUGSM8KTQAAvg\.Models↓\\quad\\downarrowBase model: Llama3\-8B\-InstructAgentRevive64\.3075\.8165\.4968\.53w/o SEO62\.0873\.5962\.3766\.01w/o SPL59\.27\(↓5\.03\)69\.85\(↓5\.96\)59\.28\(↓6\.21\)62\.80\(↓5\.73\)w/o SMP61\.2571\.8160\.7964\.62Base model: Qwen2\.5\-72B\-InstructAgentRevive86\.0994\.8772\.0584\.34w/o SEO83\.8792\.9568\.7681\.86w/o SPL82\.70\(↓3\.39\)92\.11\(↓2\.76\)65\.82\(↓6\.23\)80\.21\(↓4\.13\)w/o SMP83\.2492\.6167\.4381\.09Table 3:Ablation study of key state\-aware learning modules inAgentRevive\. The red down arrow\(↓\)\{\{\\color\[rgb\]\{1,0,0\}\(\\downarrow\)\}\}indicates the greatest performance drop\. “TQA”: TruthfulQA\.
### 5\.2Ablation Study

We conduct an ablation study analyzing each key component in Table[3](https://arxiv.org/html/2605.17348#S5.T3)\. \(1\)w/o SEO: Removing edge masking for “Terminated” agents causes performance drops \(e\.g\., 68\.53→66\.01 for Llama3\-8BDubeyet al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib26)\)\), confirming state\-based sparsity optimization is essential\. \(2\)w/o SPL: We use replaced random state transitions with learned policy severely degrades performance, underscoring that learned state management is core to mitigating unreliable agents, especially on TruthfulQA\. \(3\)w/o SMP: Propagating messages from all states introduces noise and reduces performance, validating SMP as a critical filter for maintaining coherent collaboration\.

![Refer to caption](https://arxiv.org/html/2605.17348v1/x3.png)Figure 3:Visualization of performance and token consumption for different multi\-agent communication topologies across MMLU, GSM8K, and HumanEval\. The number of tokens consumed is represented by the sum of prompt tokens and completion tokens generated by the agents on the horizontal axis\.
### 5\.3Detailed Analysis

#### 5\.3\.1Trade\-off between Performance and Token Cost

As shown in Fig\.[3](https://arxiv.org/html/2605.17348#S5.F3),AgentReviveachieves superior Pareto efficiency on MMLU, GSM8K, and HumanEval with Llama3\-8B, attaining competitive performance with modest token overhead\. This efficiency stems from our state\-aware dynamic management: “Standby” agents are suspended and their compressed historical outputs are reused \(Eq\.[9](https://arxiv.org/html/2605.17348#S4.E9)\), reducing token footprint while preserving contribution potential\. In contrast, fixed\-topology MAS \(e\.g\.,MASround=T\\text\{MAS\}\_\{\\text\{round\}=T\}\) incur high token costs from rigid, fully\-connected templates\. While graph\-pruning methods reduce redundancy, their token savings remain suboptimal as hard pruning occurs only after costly multi\-round discussions\.

TransitionTypeSelf\-Riskfrisk\(t\)f\_\{\\text\{risk\}\}^\{\(t\)\}Message GainG\(𝒵i\(t\)\)G\(\\mathcal\{Z\}\_\{i\}^\{\(t\)\}\)Roundtt“A”→\\rightarrow“S”0\.82\-0\.110\.05“A”→\\rightarrow“T”0\.75\-0\.240\.31“S”→\\rightarrow“A”\-0\.090\.69\-0\.12“S”→\\rightarrow“T”0\.58\-0\.410\.27Table 4:Attribution of state transitions based on feature importance from logistic regression\. Positive values indicate that higher feature values promote the transition\.ModelMMLUAcc\. \(%\)PerformanceImprovementTokensNumber \(M\)TokensSavingAvg\. InferenceTime \(s/sample\)TrainingTime \(GPU hours\)MASround=T\{\}\_\{\\text\{round\}=T\}60\.13–1\.99–2\.21–ARG\-Designer61\.49\+2\.3%1\.71−\-14\.1%3\.8228\.5AgentDropout62\.75\+4\.4%1\.53−\-23\.1%1\.6215\.2AgentRevive64\.30\(Δ=1\.55\\Delta=1\.55\)\+6\.9%1\.32\(Δ=0\.21\\Delta=0\.21M\)−\-33\.7%1\.75\(Δ=0\.13\\Delta=0\.13s\)20\.8\(Δ=5\.6\\Delta=5\.6GB\)Table 5:Comprehensive trade\-off analysis of performance and computational machine resource cost\. Both performance improvement and number of tokens saved are considered relative toMASround=T\{\}\_\{\\text\{round\}=T\}\.
#### 5\.3\.2Attribution Analysis of States Changes

To quantitatively interpret the decision\-making process of our state transition policyπθ\\pi\_\{\\theta\}, we conduct an attribution analysis examining the correlation between state changes and key input features\. We posit that state transitions are primarily driven by two factors: agent’s response risk and the quality of received messages\. Specifically, we collect an extensive log of state transition events\(si\(t−1\)→si\(t\)\)\(s\_\{i\}^\{\(t\-1\)\}\\rightarrow s\_\{i\}^\{\(t\)\}\)across all testing benchmarks\. For each transition, we extract the following feature set:

- •Self\-Risk\(frisk\(t\)f\_\{\\text\{risk\}\}^\{\(t\)\}\): The hallucination risk score \(Eq\.[11](https://arxiv.org/html/2605.17348#S4.E11)\) of the agent’s response at roundtt\.
- •Message Information Gain: The informational quality of received messagesG\(𝒵i\(t\)\)G\(\\mathcal\{Z\}\_\{i\}^\{\(t\)\}\), quantified by the negative entropy of the message set,G\(𝒵i\(t\)\)=−ℍ\(𝒵i\(t\)\)G\(\\mathcal\{Z\}\_\{i\}^\{\(t\)\}\)=\-\\mathbb\{H\}\(\\mathcal\{Z\}\_\{i\}^\{\(t\)\}\)\. Lower entropy \(higher gain\) indicates more consistent and confident information from neighbors\.
- •Round Index\(tt\): The communication round\.

We then train a multi\-class logistic regression classifier to predict the state transition type \(e\.g\., “Active”→\\rightarrow“Standby”\)\. The standardized coefficients reveal which factors drive specific state changes\. As shown in Table[4](https://arxiv.org/html/2605.17348#S5.T4), yields two critical insights: \(1\)Risk\-Driven Standby/Termination: The transition from “Active” to “Standby” or “Terminated” is predominantly and positively correlated with high Self\-Riskfrisk\(t\)f\_\{\\text\{risk\}\}^\{\(t\)\}\. This confirms that our policyπθ\\pi\_\{\\theta\}learns to proactively identify and suspend agents that are generating unreliable or hallucinatory content\. \(2\)Message\-Driven Reactivation: Conversely, the transition from “Standby” back to “Active” is most strongly correlated with high message information gainG\(𝒵i\(t\)\)G\(\\mathcal\{Z\}\_\{i\}^\{\(t\)\}\)\. It indicates that the policy effectively identifies when the collaborative environment has evolved\. Through consistent messages from other neighboring agents, enabling it favorable to re\-engage a previously “Standby” agent\. This demonstrates the system’s capability to opportunistically revive “zombie” agents when the context becomes conducive to their contribution\. However, the round indexttshows a weaker but positive correlation with “Terminated” events, suggesting a tendency to permanently prune agents that remain unreliable over multiple rounds\.

#### 5\.3\.3Computational Resources Analysis

We evaluate the performance and machine resource trade\-off betweenAgentReviveand three typical MAS baselines\. All comparisons are based on identical experimental environments using MMLU samples with Llama3\-8B and a single A100 GPU\.

As shown in Table[5](https://arxiv.org/html/2605.17348#S5.T5),AgentRevivedemonstrates superior overall efficiency compared to three strong MAS baseline types\. While the fixed\-topology MASround=T\{\}\_\{\\text\{round=T\}\}serves as a computationally expensive baseline and the node\-generative ARG\-Designer incurs high inference latency from its complex network,AgentRevivestrikes an optimal balance\. It significantly outperforms efficient hard\-pruning method AgentDropout in both final accuracy and token savings, achieving this with only a modest increase in inference time \(\+0\.13\+0\.13s\) due to the state prediction using lightweight policy network\. Since high\-precision tasks targeting real\-world scenarios only require one model training before deployment to users, the training GPU cost can be negligible\. For application scenarios requiring long\-term operation and sensitive to inference costs \(especially token consumption\), the substantial savings achieved by AgentRevive can quickly offset its initial training cost\.

#### 5\.3\.4Robustness Verification

Due to the space limitation, we present the detailed robustness verification analysis in Appendix[D\.2](https://arxiv.org/html/2605.17348#A4.SS2)\.

In summary, we evaluate theAgentRevive’s and strong baselines through two aspects: \(1\) Prompt Attack, where input and response prompts are adversarially manipulated, and \(2\) Different Graph Structure Initialization, testing performance under varied topologies \(e\.g\., Layered and Random\)\. Results showAgentRevivesustains minimal performance degradation under attacks and maintains stable efficiency across graph structures, demonstrating superior resilience compared to pruning\-based and fixed\-topology baselines\.

## 6Conclusion

In this paper, we proposeAgentRevive, a Markov state\-aware framework for resilient multi\-agent evolution\. By modeling agent collaboration through soft state transitions, our approach avoids the irreversibility of hard pruning\. It integrates state\-aware policy learning, which dynamically manages agent states using a risk\-aware policy\. Additionally, state\-aware edge optimization sparsifies the communication graph by masking terminated agents and reusing compressed outputs for “Standby” agents\. Extensive evaluations show thatAgentReviveachieves superior task performance while maintaining competitive token efficiency\.

## Limitations

Despite the promising results achieved byAgentRevive, our work has several limitations that warrant further investigation\. Due to constraints in computational resources, our empirical evaluation was primarily conducted with a maximum of 5 agents\. We anticipate that scaling to scenarios involving a larger number of agents would provide a more comprehensive stress test of our state transition policy’s scalability and efficiency\. Additionally, the configuration of our Markov state\-aware framework involves several key hyperparameters, such as the survival thresholdγ\\gammaand the risk balance coefficientηrisk\\eta\_\{\\text\{risk\}\}\. While we foundγ=0\.6\\gamma=0\.6andηrisk=0\.5\\eta\_\{\\text\{risk\}\}=0\.5to work well in our experiments, these values were not extensively explored across all possible task types and agent compositions\. The optimal configuration may vary across applications, suggesting a need for more adaptive or automated hyperparameter tuning strategies in future work\. Finally, the current implementation of the state policy networkπθ\\pi\_\{\\theta\}uses a relatively simple MLP architecture for stable and sample\-efficient training\. Exploring more powerful sequence models or graph\-aware architectures for state encoding could potentially capture more complex, long\-range dependencies in agent interaction histories, possibly leading to more refined state management\.

## Acknowledgments

This work was supported by the National Natural Science Foundation of China \(Grant No\. 62506110\)\. It was also supported by the Natural Science Foundation of Anhui Province, China \(Grant No\. 2508085QF227\) and the Hefei University of Technology Scientific Research Innovation Start\-up Special Project Type A \(Grant No\. JZ2025HGQA0137\)\.

## References

- L\. Boyi, Z\. Zhao, D\. Lee, and G\. Wang \(2025\)Adaptive graph pruning for multi\-agent communication\.CoRRabs/2506\.02951\.External Links:[Link](https://doi.org/10.48550/arXiv.2506.02951)Cited by:[§1](https://arxiv.org/html/2605.17348#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.17348#S2.SS2.p1.1),[§3](https://arxiv.org/html/2605.17348#S3.p1.1)\.
- M\. Cemri, M\. Z\. Pan, S\. Yang, L\. A\. Agrawal, B\. Chopra, R\. Tiwari, K\. Keutzer, A\. G\. Parameswaran, D\. Klein, K\. Ramchandran, M\. Zaharia, J\. E\. Gonzalez, and I\. Stoica \(2025\)Why do multi\-agent LLM systems fail?\.CoRRabs/2503\.13657\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.13657)Cited by:[1st item](https://arxiv.org/html/2605.17348#S1.I1.i1.p1.1)\.
- C\. Chang, Z\. Jiang, V\. Rakesh, M\. Pan, C\. M\. Yeh, G\. Wang, M\. Hu, Z\. Xu, Y\. Zheng, M\. Das, and N\. Zou \(2025\)MAIN\-RAG: multi\-agent filtering retrieval\-augmented generation\.InACL,pp\. 2607–2622\.External Links:[Link](https://aclanthology.org/2025.acl-long.131/)Cited by:[§D\.1](https://arxiv.org/html/2605.17348#A4.SS1.p2.3),[§5\.1](https://arxiv.org/html/2605.17348#S5.SS1.p1.2)\.
- L\. Chen, J\. Q\. Davis, B\. Hanin, P\. Bailis, I\. Stoica, M\. Zaharia, and J\. Zou \(2024a\)Are more LLM calls all you need? towards scaling laws of compound inference systems\.CoRRabs/2403\.02419\.External Links:[Link](https://doi.org/10.48550/arXiv.2403.02419)Cited by:[§2\.1](https://arxiv.org/html/2605.17348#S2.SS1.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.CoRRabs/2107\.03374\.External Links:[Link](https://arxiv.org/abs/2107.03374)Cited by:[2nd item](https://arxiv.org/html/2605.17348#A2.I1.i2.p1.1),[§D\.1](https://arxiv.org/html/2605.17348#A4.SS1.p2.3),[§5\.1](https://arxiv.org/html/2605.17348#S5.SS1.p1.2)\.
- W\. Chen, Y\. Su, J\. Zuo, C\. Yang, C\. Yuan, C\. Chan, H\. Yu, Y\. Lu, Y\. Hung, C\. Qian, Y\. Qin, X\. Cong, R\. Xie, Z\. Liu, M\. Sun, and J\. Zhou \(2024b\)AgentVerse: facilitating multi\-agent collaboration and exploring emergent behaviors\.InICLR,External Links:[Link](https://openreview.net/forum?id=EHg5GDnyq1)Cited by:[2nd item](https://arxiv.org/html/2605.17348#A2.I2.i2.p1.2),[§D\.1](https://arxiv.org/html/2605.17348#A4.SS1.p2.3),[§5\.1](https://arxiv.org/html/2605.17348#S5.SS1.p1.2)\.
- W\. Chen, Y\. Su, J\. Zuo, C\. Yang, C\. Yuan, C\. Qian, C\. Chan, Y\. Qin, Y\. Lu, R\. Xie, Z\. Liu, M\. Sun, and J\. Zhou \(2023\)AgentVerse: facilitating multi\-agent collaboration and exploring emergent behaviors in agents\.CoRRabs/2308\.10848\.External Links:[Link](https://doi.org/10.48550/arXiv.2308.10848)Cited by:[§2\.1](https://arxiv.org/html/2605.17348#S2.SS1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.CoRRabs/2110\.14168\.External Links:[Link](https://arxiv.org/abs/2110.14168)Cited by:[1st item](https://arxiv.org/html/2605.17348#A2.I1.i1.p1.1)\.
- DeepSeek\-AI, A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan, D\. Dai, D\. Guo, D\. Yang, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Bao, H\. Xu, H\. Wang, H\. Zhang, H\. Ding, H\. Xin, H\. Gao, H\. Li, H\. Qu, J\. L\. Cai, J\. Liang, J\. Guo, J\. Ni, J\. Li, J\. Wang, J\. Chen, J\. Chen, J\. Yuan, J\. Qiu, J\. Li, J\. Song, K\. Dong, K\. Hu, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, L\. Zhao, L\. Wang, L\. Zhang, M\. Li, M\. Wang, M\. Zhang, M\. Zhang, M\. Tang, M\. Li, N\. Tian, P\. Huang, P\. Wang, P\. Zhang, Q\. Wang, Q\. Zhu, Q\. Chen, Q\. Du, R\. J\. Chen, R\. L\. Jin, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. Xu, R\. Zhang, R\. Chen, S\. S\. Li, S\. Lu, S\. Zhou, S\. Chen, S\. Wu, S\. Ye, S\. Ye, S\. Ma, S\. Wang, S\. Zhou, S\. Yu, S\. Zhou, S\. Pan, T\. Wang, T\. Yun, T\. Pei, T\. Sun, W\. L\. Xiao, W\. Zeng, W\. Zhao, W\. An, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, X\. Q\. Li, X\. Jin, X\. Wang, X\. Bi, X\. Liu, X\. Wang, X\. Shen, X\. Chen, X\. Zhang, X\. Chen, X\. Nie, X\. Sun, X\. Wang, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yu, X\. Song, X\. Shan, X\. Zhou, X\. Yang, X\. Li, X\. Su, X\. Lin, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. X\. Zhu, Y\. Zhang, Y\. Xu, Y\. Xu, Y\. Huang, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Li, Y\. Wang, Y\. Yu, Y\. Zheng, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Tang, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Wu, Y\. Ou, Y\. Zhu, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Zha, Y\. Xiong, Y\. Ma, Y\. Yan, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Z\. F\. Wu, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Huang, Z\. Zhang, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Gou, Z\. Ma, Z\. Yan, Z\. Shao, Z\. Xu, Z\. Wu, Z\. Zhang, Z\. Li, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Gao, and Z\. Pan \(2025\)DeepSeek\-v3 technical report\.External Links:2412\.19437,[Link](https://arxiv.org/abs/2412.19437)Cited by:[1st item](https://arxiv.org/html/2605.17348#A2.I2.i1.p1.1)\.
- Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch \(2024\)Improving factuality and reasoning in language models through multiagent debate\.InICML,External Links:[Link](https://openreview.net/forum?id=zj7YuTE4t8)Cited by:[§2\.1](https://arxiv.org/html/2605.17348#S2.SS1.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Rozière, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. M\. Kloumann, I\. Misra, I\. Evtimov, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, and et al\. \(2024\)The llama 3 herd of models\.CoRRabs/2407\.21783\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.21783)Cited by:[1st item](https://arxiv.org/html/2605.17348#A2.I2.i1.p1.1),[§5\.2](https://arxiv.org/html/2605.17348#S5.SS2.p1.1)\.
- B\. Gan, Y\. Zhao, T\. Zhang, J\. Huang, Y\. Li, S\. X\. Teo, C\. Zhang, and W\. Shi \(2025\)MASTER: A multi\-agent system with LLM specialized MCTS\.InNAACL,pp\. 9409–9426\.External Links:[Link](https://doi.org/10.18653/v1/2025.naacl-long.476)Cited by:[§1](https://arxiv.org/html/2605.17348#S1.p2.1)\.
- T\. Guo, X\. Chen, Y\. Wang, R\. Chang, S\. Pei, N\. V\. Chawla, O\. Wiest, and X\. Zhang \(2024\)Large language model based multi\-agents: A survey of progress and challenges\.InIJCAI,pp\. 8048–8057\.External Links:[Link](https://www.ijcai.org/proceedings/2024/890)Cited by:[§1](https://arxiv.org/html/2605.17348#S1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InICLR,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[1st item](https://arxiv.org/html/2605.17348#A2.I1.i1.p1.1),[§D\.1](https://arxiv.org/html/2605.17348#A4.SS1.p2.3),[§D\.3](https://arxiv.org/html/2605.17348#A4.SS3.p1.1),[§5\.1](https://arxiv.org/html/2605.17348#S5.SS1.p1.2)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Long short\-term memory\.Neural Computation9\(8\),pp\. 1735–1780\.External Links:ISSN 0899\-7667,[Link](https://doi.org/10.1162/neco.1997.9.8.1735),https://direct\.mit\.edu/neco/article\-pdf/9/8/1735/813796/neco\.1997\.9\.8\.1735\.pdfCited by:[§4\.2\.2](https://arxiv.org/html/2605.17348#S4.SS2.SSS2.p1.7)\.
- S\. Holt, M\. R\. Luyten, and M\. van der Schaar \(2023\)L2MAC: large language model automatic computer for unbounded code generation\.CoRRabs/2310\.02003\.External Links:[Link](https://doi.org/10.48550/arXiv.2310.02003)Cited by:[§2\.1](https://arxiv.org/html/2605.17348#S2.SS1.p1.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, J\. Wang, C\. Zhang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin, L\. Zhou, C\. Ran, L\. Xiao, C\. Wu, and J\. Schmidhuber \(2024\)MetaGPT: meta programming for A multi\-agent collaborative framework\.InICLR,External Links:[Link](https://openreview.net/forum?id=VtmBAGCN7o)Cited by:[§2\.1](https://arxiv.org/html/2605.17348#S2.SS1.p1.1)\.
- Y\. Hu, Y\. Cai, Y\. Du, X\. Zhu, X\. Liu, Z\. Yu, Y\. Hou, S\. Tang, and S\. Chen \(2025\)Self\-evolving multi\-agent collaboration networks for software development\.InICLR,External Links:[Link](https://openreview.net/forum?id=4R71pdPBZp)Cited by:[§2\.2](https://arxiv.org/html/2605.17348#S2.SS2.p1.1)\.
- Y\. Ishibashi and Y\. Nishimura \(2024\)Self\-organized agents: A LLM multi\-agent framework toward ultra large\-scale code generation and optimization\.CoRRabs/2404\.02183\.External Links:[Link](https://doi.org/10.48550/arXiv.2404.02183)Cited by:[§2\.1](https://arxiv.org/html/2605.17348#S2.SS1.p1.1)\.
- J\. Ji, R\. Lei, J\. Bi, Z\. Wei, Y\. Lin, X\. Pan, Y\. Li, and B\. Ding \(2024\)Dynamic and textual graph generation via large\-scale llm\-based agent simulation\.CoRRabs/2410\.09824\.External Links:[Link](https://doi.org/10.48550/arXiv.2410.09824)Cited by:[§2\.2](https://arxiv.org/html/2605.17348#S2.SS2.p1.1)\.
- D\. Jiang, X\. Ren, and B\. Y\. Lin \(2023\)LLM\-blender: ensembling large language models with pairwise ranking and generative fusion\.InACL,pp\. 14165–14178\.External Links:[Link](https://doi.org/10.18653/v1/2023.acl-long.792)Cited by:[§2\.1](https://arxiv.org/html/2605.17348#S2.SS1.p1.1)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam, H\. Miller, M\. Zaharia, and C\. Potts \(2024\)DSPy: compiling declarative language model calls into state\-of\-the\-art pipelines\.InICLR,External Links:[Link](https://openreview.net/forum?id=sY5N0zY5Od)Cited by:[§2\.2](https://arxiv.org/html/2605.17348#S2.SS2.p1.1)\.
- S\. Li, Y\. Liu, Q\. Wen, C\. Zhang, and S\. Pan \(2025\)Assemble your crew: automatic multi\-agent communication topology design via autoregressive graph generation\.CoRRabs/2507\.18224\.External Links:[Link](https://doi.org/10.48550/arXiv.2507.18224)Cited by:[3rd item](https://arxiv.org/html/2605.17348#A2.I2.i3.p1.1),[2nd item](https://arxiv.org/html/2605.17348#A4.I2.i2.p1.1),[§1](https://arxiv.org/html/2605.17348#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.17348#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2605.17348#S3.T1.1.6.1)\.
- H\. Lin, Y\. Deng, Y\. Gu, W\. Zhang, J\. Ma, S\. Ng, and T\. Chua \(2025a\)FACT\-AUDIT: an adaptive multi\-agent framework for dynamic fact\-checking evaluation of large language models\.InACL,pp\. 360–381\.External Links:[Link](https://aclanthology.org/2025.acl-long.17/)Cited by:[§1](https://arxiv.org/html/2605.17348#S1.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InACL,pp\. 3214–3252\.External Links:[Link](https://doi.org/10.18653/v1/2022.acl-long.229)Cited by:[3rd item](https://arxiv.org/html/2605.17348#A2.I1.i3.p1.1),[§D\.1](https://arxiv.org/html/2605.17348#A4.SS1.p2.3),[§5\.1](https://arxiv.org/html/2605.17348#S5.SS1.p1.2)\.
- X\. Lin, Y\. Ning, J\. Zhang, Y\. Dong, Y\. Liu, Y\. Wu, X\. Qi, N\. Sun, Y\. Shang, P\. Cao, L\. Zou, X\. Chen, C\. Zhou, J\. Wu, S\. Pan, B\. Wang, Y\. Cao, K\. Chen, S\. Hu, and L\. Guo \(2025b\)LLM\-based agents suffer from hallucinations: a survey of taxonomy, methods, and directions\.External Links:[Link](https://arxiv.org/abs/2509.18970)Cited by:[§3](https://arxiv.org/html/2605.17348#S3.p1.1),[§4\.2](https://arxiv.org/html/2605.17348#S4.SS2.p1.1)\.
- W\. Ling, D\. Yogatama, C\. Dyer, and P\. Blunsom \(2017\)Program induction by rationale generation: learning to solve and explain algebraic word problems\.InACL,pp\. 158–167\.External Links:[Link](https://doi.org/10.18653/v1/P17-1015)Cited by:[2nd item](https://arxiv.org/html/2605.17348#A2.I1.i2.p1.1)\.
- Z\. Liu, Y\. Zhang, P\. Li, Y\. Liu, and D\. Yang \(2024\)A dynamic llm\-powered agent network for task\-oriented agent collaboration\.External Links:[Link](https://arxiv.org/abs/2310.02170)Cited by:[§2\.2](https://arxiv.org/html/2605.17348#S2.SS2.p1.1)\.
- A\. Patel, S\. Bhattamishra, and N\. Goyal \(2021\)Are NLP models really able to solve simple math word problems?\.InNAACL,pp\. 2080–2094\.External Links:[Link](https://doi.org/10.18653/v1/2021.naacl-main.168)Cited by:[2nd item](https://arxiv.org/html/2605.17348#A2.I1.i2.p1.1)\.
- C\. Qian, W\. Liu, H\. Liu, N\. Chen, Y\. Dang, J\. Li, C\. Yang, W\. Chen, Y\. Su, X\. Cong, J\. Xu, D\. Li, Z\. Liu, and M\. Sun \(2024\)ChatDev: communicative agents for software development\.InACL,pp\. 15174–15186\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.810)Cited by:[§2\.1](https://arxiv.org/html/2605.17348#S2.SS1.p1.1)\.
- C\. Qian, Z\. Xie, Y\. Wang, W\. Liu, K\. Zhu, H\. Xia, Y\. Dang, Z\. Du, W\. Chen, C\. Yang, Z\. Liu, and M\. Sun \(2025\)Scaling large language model\-based multi\-agent collaboration\.InICLR,External Links:[Link](https://openreview.net/forum?id=K3n5jPkrU6)Cited by:[§1](https://arxiv.org/html/2605.17348#S1.p2.1)\.
- S\. Wang, Z\. Tan, Z\. Chen, S\. Zhou, T\. Chen, and J\. Li \(2025a\)AnyMAC: cascading flexible multi\-agent collaboration via next\-agent prediction\.CoRRabs/2506\.17784\.External Links:[Link](https://doi.org/10.48550/arXiv.2506.17784)Cited by:[§1](https://arxiv.org/html/2605.17348#S1.p1.1),[§1](https://arxiv.org/html/2605.17348#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.17348#S2.SS2.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InICLR,External Links:[Link](https://openreview.net/forum?id=1PL1NIMMrw)Cited by:[1st item](https://arxiv.org/html/2605.17348#A2.I2.i1.p1.1),[§D\.1](https://arxiv.org/html/2605.17348#A4.SS1.p2.3),[§2\.1](https://arxiv.org/html/2605.17348#S2.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.17348#S5.SS1.p1.2)\.
- Z\. Wang, Y\. Wang, X\. Liu, L\. Ding, M\. Zhang, J\. Liu, and M\. Zhang \(2025b\)AgentDropout: dynamic agent elimination for token\-efficient and high\-performance llm\-based multi\-agent collaboration\.InACL,pp\. 24013–24035\.External Links:[Link](https://aclanthology.org/2025.acl-long.1170/)Cited by:[Appendix A](https://arxiv.org/html/2605.17348#A1.p2.20),[Appendix A](https://arxiv.org/html/2605.17348#A1.p3.7),[3rd item](https://arxiv.org/html/2605.17348#A2.I2.i3.p1.1),[2nd item](https://arxiv.org/html/2605.17348#A4.I2.i2.p1.1),[§1](https://arxiv.org/html/2605.17348#S1.p1.1),[§1](https://arxiv.org/html/2605.17348#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.17348#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2605.17348#S3.T1.1.5.1),[§4\.2](https://arxiv.org/html/2605.17348#S4.SS2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InNeurIPS,External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html)Cited by:[1st item](https://arxiv.org/html/2605.17348#A2.I2.i1.p1.1),[§D\.1](https://arxiv.org/html/2605.17348#A4.SS1.p2.3),[§2\.1](https://arxiv.org/html/2605.17348#S2.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.17348#S5.SS1.p1.2)\.
- R\. J\. Williams \(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Mach\. Learn\.,pp\. 229–256\.Cited by:[§4\.3](https://arxiv.org/html/2605.17348#S4.SS3.p2.7),[§4\.4](https://arxiv.org/html/2605.17348#S4.SS4.p1.2)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu, A\. H\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang \(2024\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversations\.InCOLM,External Links:[Link](https://openreview.net/forum?id=BAakY1hNKS)Cited by:[§D\.1](https://arxiv.org/html/2605.17348#A4.SS1.p2.3),[§5\.1](https://arxiv.org/html/2605.17348#S5.SS1.p1.2)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, S\. Zhang, E\. Zhu, B\. Li, L\. Jiang, X\. Zhang, and C\. Wang \(2023\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversation framework\.CoRRabs/2308\.08155\.External Links:[Link](https://doi.org/10.48550/arXiv.2308.08155)Cited by:[2nd item](https://arxiv.org/html/2605.17348#A2.I2.i2.p1.2),[§2\.1](https://arxiv.org/html/2605.17348#S2.SS1.p1.1)\.
- V\. D\. Xuan, H\. Vo, D\. Murphy, and H\. D\. Nguyen \(2025\)AgentSGEN: multi\-agent LLM in the loop for semantic collaboration and generation of synthetic data\.CoRRabs/2505\.13466\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.13466)Cited by:[§D\.1](https://arxiv.org/html/2605.17348#A4.SS1.p2.3),[§5\.1](https://arxiv.org/html/2605.17348#S5.SS1.p1.2)\.
- B\. Yan, X\. Zhang, L\. Zhang, L\. Zhang, Z\. Zhou, D\. Miao, and C\. Li \(2025\)Beyond self\-talk: A communication\-centric survey of llm\-based multi\-agent systems\.CoRRabs/2502\.14321\.External Links:[Link](https://doi.org/10.48550/arXiv.2502.14321)Cited by:[§1](https://arxiv.org/html/2605.17348#S1.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2024\)Qwen2\.5 technical report\.CoRRabs/2412\.15115\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.15115),[Document](https://dx.doi.org/10.48550/ARXIV.2412.15115)Cited by:[1st item](https://arxiv.org/html/2605.17348#A2.I2.i1.p1.1)\.
- G\. Zhang, Y\. Yue, Z\. Li, S\. Yun, G\. Wan, K\. Wang, D\. Cheng, J\. X\. Yu, and T\. Chen \(2025a\)Cut the crap: an economical communication pipeline for llm\-based multi\-agent systems\.InICLR,External Links:[Link](https://openreview.net/forum?id=LkzuPorQ5L)Cited by:[Appendix A](https://arxiv.org/html/2605.17348#A1.p2.20),[3rd item](https://arxiv.org/html/2605.17348#A2.I2.i3.p1.1),[2nd item](https://arxiv.org/html/2605.17348#A4.I2.i2.p1.1),[§1](https://arxiv.org/html/2605.17348#S1.p1.1),[§1](https://arxiv.org/html/2605.17348#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.17348#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2605.17348#S3.T1.1.3.1),[§3](https://arxiv.org/html/2605.17348#S3.p1.1),[§3](https://arxiv.org/html/2605.17348#S3.p3.4),[§4\.2\.2](https://arxiv.org/html/2605.17348#S4.SS2.SSS2.p2.12),[§4\.2](https://arxiv.org/html/2605.17348#S4.SS2.p1.1),[§4\.3](https://arxiv.org/html/2605.17348#S4.SS3.p2.11),[§4\.3](https://arxiv.org/html/2605.17348#S4.SS3.p2.7)\.
- G\. Zhang, Y\. Yue, X\. Sun, G\. Wan, M\. Yu, J\. Fang, K\. Wang, and D\. Cheng \(2025b\)G\-designer: architecting multi\-agent communication topologies via graph neural networks\.External Links:[Link](https://openreview.net/pdf?id=LpE54NUnmO)Cited by:[3rd item](https://arxiv.org/html/2605.17348#A2.I2.i3.p1.1),[Table 1](https://arxiv.org/html/2605.17348#S3.T1.1.4.1)\.
- J\. Zhang, X\. Xu, N\. Zhang, R\. Liu, B\. Hooi, and S\. Deng \(2024a\)Exploring collaboration mechanisms for LLM agents: A social psychology view\.InACL,pp\. 14544–14607\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.782)Cited by:[§2\.1](https://arxiv.org/html/2605.17348#S2.SS1.p1.1)\.
- S\. Zhang, M\. Yin, J\. Zhang, J\. Liu, Z\. Han, J\. Zhang, B\. Li, C\. Wang, H\. Wang, Y\. Chen, and Q\. Wu \(2025c\)Which agent causes task failures and when? on automated failure attribution of LLM multi\-agent systems\.CoRRabs/2505\.00212\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.00212)Cited by:[1st item](https://arxiv.org/html/2605.17348#S1.I1.i1.p1.1)\.
- T\. Zhang, D\. Li, Q\. Chen, C\. Wang, and X\. He \(2025d\)BELLE: A bi\-level multi\-agent reasoning framework for multi\-hop question answering\.InACL,pp\. 4184–4202\.External Links:[Link](https://aclanthology.org/2025.acl-long.211/)Cited by:[§1](https://arxiv.org/html/2605.17348#S1.p1.1)\.
- Y\. Zhang, R\. Sun, Y\. Chen, T\. Pfister, R\. Zhang, and S\. Ö\. Arik \(2024b\)Chain of agents: large language models collaborating on long\-context tasks\.InNeurIPS,External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/ee71a4b14ec26710b39ee6be113d7750-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2605.17348#S1.p2.1)\.
- M\. Zhuge, W\. Wang, L\. Kirsch, F\. Faccio, D\. Khizbullin, and J\. Schmidhuber \(2024\)GPTSwarm: language agents as optimizable graphs\.InICML,External Links:[Link](https://openreview.net/forum?id=uTC9AFXIhg)Cited by:[§1](https://arxiv.org/html/2605.17348#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.17348#S2.SS2.p1.1)\.

NotationDescription𝒢\\mathcal\{G\}state\-aware initial graph𝒢~\\tilde\{\\mathcal\{G\}\}trainable weighted graph𝒱\\mathcal\{V\}set of agent nodesℰ𝒯\\mathcal\{E\}^\{\\mathcal\{T\}\}temporal edgesℰ𝒮\\mathcal\{E\}^\{\\mathcal\{S\}\}spatial edges𝒮\(t\)\\mathcal\{S\}^\{\(t\)\}all agents statessi\(t\)s\_\{i\}^\{\(t\)\}state of agentviv\_\{i\}𝒢eff\(t\)\\mathcal\{G\}\_\{\\text\{eff\}\}^\{\(t\)\}learned effective subgraph𝒱eff\(t\)\\mathcal\{V\}\_\{\\text\{eff\}\}^\{\(t\)\}effective agent nodes𝒵i\(t\)\\mathcal\{Z\}\_\{i\}^\{\(t\)\}aggregated messages𝒜~𝒮\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{S\}\}spatial adjacency matrices𝒜~𝒯\\tilde\{\\mathcal\{A\}\}\_\{\\mathcal\{T\}\}temporal adjacency matricesoi\(t\)o\_\{i\}^\{\(t\)\}response of agentviv\_\{i\}ℳvi\(t\)\\mathcal\{M\}\_\{v\_\{i\}\}^\{\(t\)\}memory state of agentviv\_\{i\}R\(τ\)R\(\\tau\)trajectory rewardfrisk\(t\)f\_\{\\text\{risk\}\}^\{\(t\)\}hallucination risk estimatorωi\\omega\_\{i\}survival rate of agentviv\_\{i\}𝐦\\mathbf\{m\}binary node mask𝒜~eff\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}effective adjacency matrixϕ\(⋅\)\\phi\(\\cdot\)utility function𝒩in𝒮\(vi\)\\mathcal\{N\}\_\{in\}^\{\\mathcal\{S\}\}\(v\_\{i\}\)spatial in\-neighbors of agentviv\_\{i\}𝒩in𝒯\(vi\)\\mathcal\{N\}\_\{in\}^\{\\mathcal\{T\}\}\(v\_\{i\}\)temporal in\-neighbors of agentviv\_\{i\}πθ\\pi\_\{\\theta\}state policy networkfEncf\_\{\\text\{Enc\}\}encoder for agent memory encodingTable 6:All used mathematical notations inAgentReviveframework\.## Appendix ANotations and Task Description

Notations\.We have organized all the mathematical notations and descriptions in this paper in Table[6](https://arxiv.org/html/2605.17348#A0.T6)\.

MAS as Collaboration Graph\.We model a multi\-agent system \(MAS\) as a collaboration graph, represented by a directed acyclic graph \(DAG\)𝒢=\(𝒱,ℰ𝒯,ℰ𝒮\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}^\{\\mathcal\{T\}\},\\mathcal\{E\}^\{\\mathcal\{S\}\}\)\. The node set𝒱=\{v1,v2,…,vN\}\\mathcal\{V\}=\\\{\\mathit\{v\}\_\{1\},\\mathit\{v\}\_\{2\},\\dots,\\mathit\{v\}\_\{N\}\\\}corresponds to the set of agents, where each agentvi\\mathit\{v\}\_\{i\}is an LLM instance assigned a specific roleri∈ℛr\_\{i\}\\in\\mathcal\{R\}that defines its function and expertise\. Here,NNdenotes the total number of agents in the graph\. Each agent also maintains an internal statehih\_\{i\}, which records its past actions and interactions\. The directed communication pathways are defined by temporal edgesℰ𝒯⊆𝒱\(t−1\)×𝒱\(t\)\\mathcal\{E\}^\{\\mathcal\{T\}\}\\subseteq\\mathcal\{V\}^\{\(t\-1\)\}\\times\\mathcal\{V\}^\{\(t\)\}and spatial edgesℰ𝒮⊆𝒱\(t\)×𝒱\(t\)\\mathcal\{E\}^\{\\mathcal\{S\}\}\\subseteq\\mathcal\{V\}^\{\(t\)\}\\times\\mathcal\{V\}^\{\(t\)\}Zhanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib6)\); Wanget al\.\([2025b](https://arxiv.org/html/2605.17348#bib.bib7)\)\. A temporal edgeeji𝒯=\(vj,vi\)e\_\{ji\}^\{\\mathcal\{T\}\}=\(\\mathit\{v\}\_\{j\},\\mathit\{v\}\_\{i\}\)indicates that messages received by agentvi\\mathit\{v\}\_\{i\}in roundttare aggregated from agentvj\\mathit\{v\}\_\{j\}in the previous round\(t−1\)\(t\-1\)\. Similarly, a spatial edgeeji𝒮=\(vj,vi\)e\_\{ji\}^\{\\mathcal\{S\}\}=\(\\mathit\{v\}\_\{j\},\\mathit\{v\}\_\{i\}\)denotes that information for agentvi\\mathit\{v\}\_\{i\}in roundttoriginates from neighboring agentvj\\mathit\{v\}\_\{j\}within the same round\. Consequently, the sets of direct predecessor neighbors for agentvi\\mathit\{v\}\_\{i\}via temporal and spatial edges are respectively defined as𝒩in𝒯\(vi\)=\{vj∣\(vj,vi\)∈ℰ𝒯\}\\mathcal\{N\}\_\{in\}^\{\\mathcal\{T\}\}\(\\mathit\{v\}\_\{i\}\)=\\\{\\mathit\{v\}\_\{j\}\\mid\(\\mathit\{v\}\_\{j\},\\mathit\{v\}i\)\\in\\mathcal\{E\}^\{\\mathcal\{T\}\}\\\}and𝒩in𝒮\(vi\)=\{vj∣\(vj,vi\)∈ℰ𝒮\}\\mathcal\{N\}\_\{in\}^\{\\mathcal\{S\}\}\(\\mathit\{v\}\_\{i\}\)=\\\{\\mathit\{v\}\_\{j\}\\mid\(\\mathit\{v\}\_\{j\},\\mathit\{v\}\_\{i\}\)\\in\\mathcal\{E\}^\{\\mathcal\{S\}\}\\\}\.

MAS Collaboration Protocol\.Given an initial collaboration graph𝒢\\mathcal\{G\}, the MAS processes a user query𝒬\\mathcal\{Q\}through a multi\-step collaboration protocol\. This protocol dictates the processing and exchange of information among agents across multiple communication rounds\. The execution order of agents within each round is determined by a topological sort of the nodesWanget al\.\([2025b](https://arxiv.org/html/2605.17348#bib.bib7)\), to ensure agents are activated only after their predecessors have completed\. This process iterates forTTrounds, enabling progressive refinement\. At each roundtt, agentvi\\mathit\{v\}\_\{i\}produces its response𝐌i\(t\)\\mathbf\{M\}\_\{i\}^\{\(t\)\}by querying its LLM using a dynamically constructed prompt𝒫i\(t\)\\mathcal\{P\}\_\{i\}^\{\(t\)\}:

𝐌i\(t\)=LLMi\(𝒫i\(t\)\)\\mathbf\{M\}\_\{i\}^\{\(t\)\}=\\text\{LLM\}\_\{i\}\(\\mathcal\{P\}\_\{i\}^\{\(t\)\}\)\(21\)where the prompt integrates the intrinsic properties of the agent with the temporal and spatial outputs of its predecessors:

𝒫i\(t\)=fpr\(ri\(t\),hi\(t\)⏟System,𝒬,\{m𝒯\(t\),m𝒮\(t\)\}⏟User\)\\mathcal\{P\}\_\{i\}^\{\(t\)\}=f\_\{\\text\{pr\}\}\\left\(\\underbrace\{r\_\{i\}^\{\(t\)\},h\_\{i\}^\{\(t\)\}\}\_\{\\text\{System\}\},\\underbrace\{\\mathcal\{Q\},\\left\\\{m\_\{\\mathcal\{T\}\}^\{\(t\)\},m\_\{\\mathcal\{S\}\}^\{\(t\)\}\\right\\\}\}\_\{\\text\{User\}\}\\right\)\(22\)wherefprf\_\{\\text\{pr\}\}denotes the prompt construction process\. Here,m𝒯\(t\)m\_\{\\mathcal\{T\}\}^\{\(t\)\}represents messages collected from the temporal neighbors𝒩in𝒯\(vi\)\\mathcal\{N\}\_\{in\}^\{\\mathcal\{T\}\}\(v\_\{i\}\)of agentvi\\mathit\{v\}\_\{i\}, whilem𝒮\(t\)m\_\{\\mathcal\{S\}\}^\{\(t\)\}represents those from the spatial neighbors𝒩in𝒮\(vi\)\\mathcal\{N\}\_\{in\}^\{\\mathcal\{S\}\}\(v\_\{i\}\)\. AfterTTrounds, the final output𝒪\\mathcal\{O\}is obtained by aggregating the final\-round responses:

𝒪=fAgg\(\{𝐌i\(T\)∣vi∈𝒱\}\)\\mathcal\{O\}=f\_\{\\text\{Agg\}\}\(\\\{\\mathbf\{M\}\_\{i\}^\{\(T\)\}\\mid\\mathit\{v\}\_\{i\}\\in\\mathcal\{V\}\\\}\)\(23\)where the aggregation strategyfAggf\_\{\\text\{Agg\}\}may vary across implementations such as majority voting\.

## Appendix BImplementation Details

### B\.1Datasets

We evaluateAgentReviveon a diverse set of benchmarks to assess general reasoning, domain\-specific, and hallucination\-challenging datasets\.

- •General Reasoning:MMLUHendryckset al\.\([2021](https://arxiv.org/html/2605.17348#bib.bib31)\)is a multi\-task benchmark covering 57 subjects across STEM, humanities, and social sciences\.GSM8KCobbeet al\.\([2021](https://arxiv.org/html/2605.17348#bib.bib24)\)consists of grade\-school math word problems requiring multi\-step reasoning\.
- •Domain\-Specific Tasks:AQuAPatelet al\.\([2021](https://arxiv.org/html/2605.17348#bib.bib23)\)includes around 100,000 algebraic word problems with natural language rationales\.SVAMPLinget al\.\([2017](https://arxiv.org/html/2605.17348#bib.bib22)\)contains simple arithmetic problems with varying structures to test robustness\.HumanEvalChenet al\.\([2021](https://arxiv.org/html/2605.17348#bib.bib30)\)comprises programming problems evaluating functional correctness in code generation\.
- •Hallucination Challenge:TruthfulQALinet al\.\([2022](https://arxiv.org/html/2605.17348#bib.bib21)\)is designed to measure a model’s tendency to produce plausible\-sounding but incorrect answers \(hallucinations\)\.

### B\.2Baselines

We compareAgentReviveagainst a broad range of strong baselines:

- •Single\-Agent Methods:Vanilla LLMs utilize standard prompting without structured reasoning, using Llama3\-8BDubeyet al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib26)\), Qwen2\.5\-72BYanget al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib29)\), and Deepseek\-V3\-671B\-InstructDeepSeek\-AIet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib28)\)as backbone models\. Chain\-of\-Thought \(CoT\)Weiet al\.\([2022](https://arxiv.org/html/2605.17348#bib.bib18)\)enhances reasoning capability through step\-by\-step solutions with a single agent, demonstrating the upper bound of individual agent performance\. Self\-Consistency \(SC\)Wanget al\.\([2023](https://arxiv.org/html/2605.17348#bib.bib34)\)samples multiple reasoning paths from a single agent and selects the most consistent answer, providing advanced single\-agent reasoning\.
- •Fixed\-Topology Multi\-Agent Systems:MASround=1\\text\{MAS\}\_\{\\text\{round\}=1\}implements single\-round multi\-agent collaboration with a fixed communication graph structure, testing basic collaborative capability without iterative refinement\.MASround=T\\text\{MAS\}\_\{\\text\{round\}=T\}extends this to multiple communication rounds in a fixed topology, evaluating the effect of extended but rigid agent interactions\. AutoGenWuet al\.\([2023](https://arxiv.org/html/2605.17348#bib.bib43)\)is a programmable multi\-agent conversation framework with customizable agent roles and interaction patterns, representing structured but static collaboration\. AgentVerseChenet al\.\([2024b](https://arxiv.org/html/2605.17348#bib.bib20)\)supports diverse multi\-agent interaction protocols using predefined templates, testing limits of manually designed topologies\.
- •Graph\-Based Dynamic MAS:G\-DesignerZhanget al\.\([2025b](https://arxiv.org/html/2605.17348#bib.bib46)\)optimizes communication via graph neural networks, learning edge weights based on task characteristics\. AgentPruneZhanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib6)\)prunes redundant edges in spatial and temporal dimensions, representing hard pruning methods\. ARG\-DesignerLiet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib11)\)generates collaboration graphs autoregressively from scratch, exploring dynamic topology construction rather than pruning\. AgentDropoutWanget al\.\([2025b](https://arxiv.org/html/2605.17348#bib.bib7)\)dynamically eliminates underperforming agents and edges across communication rounds, combining node and edge dropout strategies for efficient collaboration\.

![Refer to caption](https://arxiv.org/html/2605.17348v1/x4.png)Figure 4:Prompt design for aggregating edge weights and neighboring agent responses \(Example from GSM8K\)\.
### B\.3Implementation Details

We implementAgentRevivein PyTorch, conducting experiments on 2 NVIDIA A800 GPUs\. For larger models \(Qwen2\.5\-72B and Deepseek\-V3\-671B\), we utilize official APIs for inference\.

We set the number of communication roundsT=2T=2for reasoning and math tasks, andT=4T=4for code generation\. The number of policy training steps isK=50K=50, and the number of graph samplesM=20M=20\. Other hyperparameters include learning rateη=0\.1\\eta=0\.1and survival thresholdγ=0\.6\\gamma=0\.6\. The balance coefficientηrisk\\eta\_\{\\text\{risk\}\}in the reward is set to0\.50\.5\.

The REINFORCE algorithm is used for policy gradient updates\. The state encoderfEncf\_\{\\text\{Enc\}\}is a single\-layer LSTM, and the state policy networkπθ\\pi\_\{\\theta\}is a 2\-layer MLP\.

Model training proceeds in two stages: State\-Aware Policy Learning, followed by State\-Aware Edge Optimization\. Each stage uses 40 training instances sampled from the corresponding dataset’s training or validation split\.

We report accuracy for all benchmarks\. Token consumption is measured as the sum of prompt and completion tokens across all agents and rounds\.

For weighted adjacency matrices and response variables of string data type, we use the prompt illustrated in Fig\.[4](https://arxiv.org/html/2605.17348#A2.F4)to combine them for propagation within the communication graph\.

0:MAS topology graph

𝒢=\(𝒱,ℰ𝒯,ℰS\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}^\{\\mathcal\{T\}\},\\mathcal\{E\}^\{S\}\), state policy parameters

θ\\theta, weighted adjacency matrices

𝒜~=𝒜~S∪𝒜~T\\tilde\{\\mathcal\{A\}\}=\\tilde\{\\mathcal\{A\}\}\_\{S\}\\cup\\tilde\{\\mathcal\{A\}\}\_\{T\}, training steps

K1,K2K\_\{1\},K\_\{2\}, sampling times

MM, learning rate

η\\eta, survival threshold

γ\\gamma, balance coefficient

ηrisk\\eta\_\{risk\}
0:Trained state policy parameters

θ∗\\theta^\{\*\}, optimized adjacency matrices

𝒜~eff\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}
1:\# Stage 1: State\-Aware Policy Learning

2:for

k=1k=1to

K1K\_\{1\}do

3:Sample

MMgraphs

\{𝒢m\}m=1M\\\{\\mathcal\{G\}\_\{m\}\\\}\_\{m=1\}^\{M\}from

𝒜~\\tilde\{\\mathcal\{A\}\}using DAG sampling

4:foreach sampled graph

𝒢m\\mathcal\{G\}\_\{m\}do

5:foragent

viv\_\{i\}at each round

ttto

TTdo

6:

𝒵i\(t\)=\[𝒵i\(S,\(t\)\)∥𝒵i\(T,\(t\)\)\]\\mathcal\{Z\}\_\{i\}^\{\(t\)\}=\[\\mathcal\{Z\}\_\{i\}^\{\(S,\(t\)\)\}\\\|\\mathcal\{Z\}\_\{i\}^\{\(T,\(t\)\)\}\]
7:

oi\(t\)=𝕀\(si\(t\)=“A”\)ℱ′\(si\(t\)\)\+𝕀\(si\(t\)=“S”\)fc\(oi\(t−1\)\)o\_\{i\}^\{\(t\)\}=\\mathbb\{I\}\(s\_\{i\}^\{\(t\)\}=\\text\{\`\`A''\}\)\\mathcal\{F\}^\{\\prime\}\(s\_\{i\}^\{\(t\)\}\)\+\\mathbb\{I\}\(s\_\{i\}^\{\(t\)\}=\\text\{\`\`S''\}\)f\_\{c\}\(o\_\{i\}^\{\(t\-1\)\}\)
8:

si\(t\)∼πθ\(⋅\|fEnc\(si\(t−1\),oi\(t\),𝒵i\(t\)\)\)s\_\{i\}^\{\(t\)\}\\sim\\pi\_\{\\theta\}\(\\cdot\|f\_\{\\text\{Enc\}\}\(s\_\{i\}^\{\(t\-1\)\},o\_\{i\}^\{\(t\)\},\\mathcal\{Z\}\_\{i\}^\{\(t\)\}\)\)
9:

frisk\(t\)=−𝔼v∈VA\(t\)\[KL\(Mv\(t\)∥M¯VA\(t\)\)\]f\_\{risk\}^\{\(t\)\}=\-\\mathbb\{E\}\_\{v\\in V\_\{A\}^\{\(t\)\}\}\[\\text\{KL\}\(M\_\{v\}^\{\(t\)\}\\\|\\bar\{M\}\_\{V\_\{A\}\}^\{\(t\)\}\)\]
10:endfor

11:endfor

12:Compute trajectory rewards:

R\(τ\)=μ\(𝒢\(T\)\)\+ηrisk⋅∑t=1Tfrisk\(t\)R\(\\tau\)=\\mu\(\\mathcal\{G\}^\{\(T\)\}\)\+\\eta\_\{risk\}\\cdot\\sum\_\{t=1\}^\{T\}f\_\{risk\}^\{\(t\)\}
13:Update policy parameters:

θ←θ\+η⋅1M∑m=1M∑t=1T∇θlog⁡πθ\(si\(t\)\)ℳvi\(t\)⋅\(R\(τ\)−b\)\\theta\\leftarrow\\theta\+\\eta\\cdot\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\sum\_\{t=1\}^\{T\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(s\_\{i\}^\{\(t\)\}\)\\mathcal\{M\}\_\{v\_\{i\}\}^\{\(t\)\}\\cdot\(R\(\\tau\)\-b\)
14:endfor

15:\# Stage 2: State\-aware Edge Optimization

16:foreach agent

viv\_\{i\}do

17:\# Compute node survival rates

18:

ωi=1L∑l=1L𝕀\(πθ\(Q\)\(vil\)≠“T”\)\\omega\_\{i\}=\\frac\{1\}\{L\}\\sum\_\{l=1\}^\{L\}\\mathbb\{I\}\(\\pi\_\{\\theta\}^\{\(Q\)\}\(v\_\{i\}^\{l\}\)\\neq\\text\{\`\`T''\}\)
19:endfor

20:\# Apply binary mask to adjacency matrices

21:

𝐦∈\{0,1\}N,mi=1\\mathbf\{m\}\\in\\\{0,1\\\}^\{N\},m\_\{i\}=1if

ωi≥γ\\omega\_\{i\}\\geq\\gamma, else

0
22:

𝒜~eff=𝐌\(Q\)⊙𝒜~⊙𝐌\(Q\)T\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}=\\mathbf\{M\}^\{\(Q\)\}\\odot\\tilde\{\\mathcal\{A\}\}\\odot\\mathbf\{M\}^\{\(Q\)^\{T\}\}
23:for

k=1k=1to

K2K\_\{2\}do

24:Sample

MMgraphs

\{𝒢m′\}m=1M\\\{\\mathcal\{G\}\_\{m\}^\{\\prime\}\\\}\_\{m=1\}^\{M\}from

𝒜~eff\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}using DAG sampling

25:\#Compute edge optimization objective

26:

J=1M∑m=1Mμ\(𝒢m′\)−\[∑t=1T‖𝒜~Seff,\(t\)‖∗\+∑t=2T‖𝒜~Teff,\(t\)‖∗\]J=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\mu\(\\mathcal\{G\}\_\{m\}^\{\\prime\}\)\-\\left\[\\sum\_\{t=1\}^\{T\}\\\|\\tilde\{\\mathcal\{A\}\}\_\{S\}^\{\\text\{eff\},\(t\)\}\\\|\_\{\*\}\+\\sum\_\{t=2\}^\{T\}\\\|\\tilde\{\\mathcal\{A\}\}\_\{T\}^\{\\text\{eff\},\(t\)\}\\\|\_\{\*\}\\right\]
27:\# Update adjacency matrices

28:

𝒜~eff←𝒜~eff\+η⋅∇𝒜~effJ\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}\\leftarrow\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}\+\\eta\\cdot\\nabla\_\{\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}\}J
29:endfor

30:return

θ∗,𝒜~eff\\theta^\{\*\},\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}

Algorithm 1AgentRevive Training Algorithm0:Trained state policy parameters

θ∗\\theta^\{\*\}, optimized adjacency matrices

𝒜~eff\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}, user query

𝒬\\mathcal\{Q\}, initial agent states

𝐬\(0\)=\{s1\(0\),s2\(0\),…,sN\(0\)\}\\mathbf\{s\}^\{\(0\)\}=\\\{s\_\{1\}^\{\(0\)\},s\_\{2\}^\{\(0\)\},\\ldots,s\_\{N\}^\{\(0\)\}\\\}, maximum communication rounds

TT
0:Final answer

𝒪\\mathcal\{O\}
1:Initialize agent memories

hi\(0\)h\_\{i\}^\{\(0\)\}for all

vi∈𝒱v\_\{i\}\\in\\mathcal\{V\}
2:Initialize agent responses

oi\(0\)o\_\{i\}^\{\(0\)\}for all

vi∈𝒱v\_\{i\}\\in\\mathcal\{V\}
3:for

t=1t=1to

TTdo

4:foreach agent

vi∈𝒱v\_\{i\}\\in\\mathcal\{V\}do

5:if

si\(t−1\)≠“Terminated”s\_\{i\}^\{\(t\-1\)\}\\neq\\text\{\`\`Terminated''\}then

6:

𝒵i\(t\)=\[𝒵i\(S,\(t\)\)∥𝒵i\(T,\(t\)\)\]\\mathcal\{Z\}\_\{i\}^\{\(t\)\}=\\left\[\\mathcal\{Z\}\_\{i\}^\{\(S,\(t\)\)\}~\\\|~\\mathcal\{Z\}\_\{i\}^\{\(T,\(t\)\)\}\\right\]
7:

si\(t\)∼πθ∗\(⋅\|fEnc\(ℳvi\(t\)\)\)s\_\{i\}^\{\(t\)\}\\sim\\pi\_\{\\theta^\{\*\}\}\(\\cdot~\|~f\_\{\\text\{Enc\}\}\(\\mathcal\{M\}\_\{v\_\{i\}\}^\{\(t\)\}\)\)
8:else

9:

si\(t\)←“Terminated”s\_\{i\}^\{\(t\)\}\\leftarrow\\text\{\`\`Terminated''\}
10:endif

11:endfor

12:foreach agent

vi∈𝒱v\_\{i\}\\in\\mathcal\{V\}do

13:if

si\(t\)=“Active”s\_\{i\}^\{\(t\)\}=\\text\{\`\`Active''\}then

14:

oi\(t\)=fpr\(ri\(t\),hi\(t−1\),q,𝒵i\(t\)\)o\_\{i\}^\{\(t\)\}=f\_\{\\text\{pr\}\}\\left\(r\_\{i\}^\{\(t\)\},h\_\{i\}^\{\(t\-1\)\},q,\\mathcal\{Z\}\_\{i\}^\{\(t\)\}\\right\)
15:\# Update Node Memory

16:

hi\(t\)←f\(hi\(t−1\),oi\(t\)\)h\_\{i\}^\{\(t\)\}\\leftarrow f\(h\_\{i\}^\{\(t\-1\)\},o\_\{i\}^\{\(t\)\}\)
17:elseif

si\(t\)=“Standby”s\_\{i\}^\{\(t\)\}=\\text\{\`\`Standby''\}then

18:

oi\(t\)←fc\(oi\(t−1\)\)=LLM\("Summarize:"\+oi\(t−1\)\)o\_\{i\}^\{\(t\)\}\\leftarrow f\_\{c\}\(o\_\{i\}^\{\(t\-1\)\}\)=\\text\{LLM\}\(\\text\{"Summarize:"\}\+o\_\{i\}^\{\(t\-1\)\}\)
19:

hi\(t\)←hi\(t−1\)h\_\{i\}^\{\(t\)\}\\leftarrow h\_\{i\}^\{\(t\-1\)\}
20:else

21:\#si\(t\)=“Terminated”s\_\{i\}^\{\(t\)\}=\\text\{\`\`Terminated''\}

22:

oi\(t\)←∅o\_\{i\}^\{\(t\)\}\\leftarrow\\emptyset
23:

hi\(t\)←hi\(t−1\)h\_\{i\}^\{\(t\)\}\\leftarrow h\_\{i\}^\{\(t\-1\)\}
24:endif

25:endfor

26:foreach agent

vi∈𝒱eff\(t\)v\_\{i\}\\in\\mathcal\{V\}\_\{\\text\{eff\}\}^\{\(t\)\}do

27:Propagate

oi\(t\)o\_\{i\}^\{\(t\)\}to spatial and temporal neighbors according to

𝒜~eff\\tilde\{\\mathcal\{A\}\}\_\{\\text\{eff\}\}
28:endfor

29:if

∀vi∈𝒱,si\(t\)∈\{“Standby”,“Terminated”\}\\forall v\_\{i\}\\in\\mathcal\{V\},s\_\{i\}^\{\(t\)\}\\in\\\{\\text\{\`\`Standby''\},\\text\{\`\`Terminated''\}\\\}then

30:break\{No active agents remain\}

31:endif

32:endfor

33:\# Answer Generation

34:

𝒱active\(T\)←\{vi∣si\(T\)=“Active”\}\\mathcal\{V\}\_\{\\text\{active\}\}^\{\(T\)\}\\leftarrow\\\{v\_\{i\}\\mid s\_\{i\}^\{\(T\)\}=\\text\{\`\`Active''\}\\\}
35:if

𝒱active\(T\)=∅\\mathcal\{V\}\_\{\\text\{active\}\}^\{\(T\)\}=\\emptysetthen

36:

𝒱active\(T\)←\{vi∣si\(T\)=“Standby”\}\\mathcal\{V\}\_\{\\text\{active\}\}^\{\(T\)\}\\leftarrow\\\{v\_\{i\}\\mid s\_\{i\}^\{\(T\)\}=\\text\{\`\`Standby''\}\\\}
37:endif

38:

𝒪=fagg\(\{oi\(T\)∣vi∈𝒱active\(T\)\}\)\\mathcal\{O\}=f\_\{\\text\{agg\}\}\(\\\{o\_\{i\}^\{\(T\)\}\\mid v\_\{i\}\\in\\mathcal\{V\}\_\{\\text\{active\}\}^\{\(T\)\}\\\}\)\(Eq\. 3\)

39:return

𝒪\\mathcal\{O\}

Algorithm 2AgentRevive Inference Algorithm

## Appendix CAlgorithm Description

The training algorithm, summarized in Algorithm[1](https://arxiv.org/html/2605.17348#alg1), captures the core two\-stage process ofAgentRevive:

- •Stage 1: State\-Aware Policy Learning—Learns optimal state transitions for agents using a risk\-aware reward signal\.
- •Stage 2: State\-aware Edge Optimization—Permanently prunes “Terminated” nodes and optimizes the remaining graph structure for both performance and sparsity\.

The inference algorithm, described in Algorithm[2](https://arxiv.org/html/2605.17348#alg2), captures the key phases of forward propagation inAgentRevive:

- •State Evolution:For each round, agents transition amongActive,Standby, andTerminatedstates based on the trained policy\.
- •Response Generation:Activeagents generate responses using current context;Standbyagents reuse compressed historical outputs;Terminatedagents produce no output\.
- •Message Propagation:OnlyActiveandStandbyagents propagate messages through the optimized graph\.
- •Early Stopping:The process halts if no agents remain in theActivestate\.
- •Answer Generation:The final output is aggregated fromActiveagents; if none remain, a fallback toStandbyagents is used\.

![Refer to caption](https://arxiv.org/html/2605.17348v1/x5.png)Figure 5:The description of prompt attack instructions \(Example for GSM8K\)\.![Refer to caption](https://arxiv.org/html/2605.17348v1/x6.png)Figure 6:Performance Comparison of prompt attack in GSK8K datasets\.Dataset→\\quad\\rightarrowVar\.NSFlex\.StateMMLUGSM8KAQuATruthfulQASVAMPHumanEvalAvg\.Models↓\\quad\\downarrowBase model: Qwen2\.5\-72B\-InstructVanilla✗✗82\.3591\.0283\.7562\.9892\.6785\.2883\.01CoT✗✗83\.66\(↑1\.31\)92\.19\(↑1\.17\)84\.58\(↑0\.83\)64\.61\(↑1\.63\)93\.35\(↑0\.68\)86\.67\(↑1\.39\)84\.18\(↑1\.17\)SC \(CoT\)✗✗83\.70\(↑1\.35\)93\.67\(↑2\.65\)86\.25\(↑2\.50\)64\.85\(↑1\.87\)93\.79\(↑1\.12\)86\.83\(↑1\.55\)84\.85\(↑1\.84\)Autogen✗✗82\.34\(↓0\.01\)92\.17\(↑1\.15\)85\.73\(↑1\.98\)65\.89\(↑2\.91\)93\.86\(↑1\.19\)87\.36\(↑2\.08\)84\.56\(↑1\.55\)AgentVerse✗✗81\.57\(↓0\.78\)91\.59\(↑0\.57\)84\.35\(↑0\.60\)66\.64\(↑3\.66\)92\.45\(↓0\.22\)87\.49\(↑2\.21\)84\.02\(↑1\.01\)MASround=1\{\}\_\{\\text\{round\}=1\}✗✗82\.35\(↑0\.00\)93\.52\(↑2\.50\)84\.58\(↑0\.83\)63\.98\(↑1\.00\)92\.36\(↓0\.31\)84\.17\(↓1\.11\)83\.49\(↑0\.48\)MASround=T\{\}\_\{\\text\{round\}=T\}✗✗84\.31\(↑1\.96\)93\.28\(↑2\.26\)85\.83\(↑2\.08\)65\.76\(↑2\.78\)94\.07\(↑1\.40\)87\.08\(↑1\.80\)85\.06\(↑2\.05\)G\-Designer✗✗81\.02\(↓1\.33\)92\.50\(↑1\.48\)86\.24\(↑2\.49\)67\.55\(↑4\.57\)92\.81\(↑0\.14\)86\.42\(↑1\.14\)84\.42\(↑1\.41\)AgentPrune✗✗83\.66\(↑1\.31\)93\.67\(↑2\.65\)87\.08\(↑3\.33\)67\.41\(↑4\.43\)94\.33\(↑1\.66\)86\.67\(↑1\.39\)85\.47\(↑2\.46\)ARG\-Designer✓✗84\.15\(↑1\.80\)92\.98\(↑1\.96\)87\.23\(↑3\.48\)68\.02\(↑5\.04\)94\.63\(↑1\.96\)86\.58\(↑1\.30\)85\.77\(↑2\.76\)AgentDropout✓✗84\.97\(↑2\.62\)93\.75\(↑2\.73\)87\.50\(↑3\.75\)69\.11\(↑6\.13\)95\.34\(↑2\.67\)87\.92\(↑2\.64\)86\.60\(↑3\.59\)AgentRevive✓✓86\.09\(↑3\.74\)94\.87\(↑3\.85\)89\.45\(↑5\.70\)72\.05\(↑9\.07\)96\.94\(↑4\.27\)90\.27\(↑4\.99\)88\.28\(↑5\.27\)Table 7:Performance comparison betweenAgentReviveand other baselines\.Var\. NSandFlex\. Statedenote the Variable Node Size and Flexible State MAS types described in Table[1](https://arxiv.org/html/2605.17348#S3.T1)\. The orange up arrow\(↑\)\{\{\\color\[rgb\]\{1,\.5,0\}\(\\uparrow\)\}\}and green down arrow\(↓\)\{\{\\color\[rgb\]\{0,0\.58984375,0\}\(\\downarrow\)\}\}respectively indicate the degree of performance improvement and decrease compared to the Vanilla base model\. Underlining indicates the second highest performance\.
## Appendix DExtra Experiments

### D\.1Main Results

Due to the space limitation, we present the general performance ofAgentReviveand other baselines across a suite of general reasoning, domain\-specific, and hallucination\-challenged benchmarks\. of Qwen2\.5\-72B in Table[7](https://arxiv.org/html/2605.17348#A3.T7)\.

We observe several key findings from the experimental results:\[1\]Regarding vanilla methods \(i\.e\., CoTWeiet al\.\([2022](https://arxiv.org/html/2605.17348#bib.bib18)\)and SCWanget al\.\([2023](https://arxiv.org/html/2605.17348#bib.bib34)\)\), they consistently outperform standard prompting across most tasks and model scales, demonstrating the importance of structured reasoning\. However, their performance gains are often limited, particularly on more complex tasks, due to reliance on a single agent’s knowledge and reasoning capacity\.\[2\]For MAS methods with fixed interaction patterns \(e\.g\.,MASround=1\\text\{MAS\}\_\{\\text\{round\}=1\},MASround=T\\text\{MAS\}\_\{\\text\{round\}=T\}, AutoGenWuet al\.\([2024](https://arxiv.org/html/2605.17348#bib.bib32)\), AgentVerseChenet al\.\([2024b](https://arxiv.org/html/2605.17348#bib.bib20)\)\), we observed that the performance of several MAS methods is sometimes inferior to single\-agent prompting, particularly on MMLUHendryckset al\.\([2021](https://arxiv.org/html/2605.17348#bib.bib31)\)and HumanEvalChenet al\.\([2021](https://arxiv.org/html/2605.17348#bib.bib30)\)\. We speculate that this is due to \(1\)Inefficient Communication Overhead: Fixed communication topologies may introduce noise or redundant information, distracting agents from finding optimal solutions on simple tasks that do not require broad knowledge integrationXuanet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib27)\); Changet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib25)\); and \(2\)Accumulation of Errors: In multi\-round systems \(MASround=T\\text\{MAS\}\_\{\\text\{round\}=T\}\), errors or hallucinations from one agent can propagate and be amplified in subsequent interactions, potentially leading to worse outcomes than single, careful reasoning chains\.\[3\]Graph\-based dynamic MAS methods generally surpass vanilla MAS approaches by optimizing the communication structure\. They demonstrate the benefits of adaptive topology in reducing redundancy and enhancing collaborative efficiency\. However, their “hard pruning” strategies risk permanently discarding agents who could be valuable in later stages, thereby limiting their potential for recovery and ultimate performance\.\[4\]Our proposedAgentReviveconsistently achieves state\-of\-the\-art performance across all baselines, with the most substantial improvements observed on the challenging TruthfulQA benchmarkLinet al\.\([2022](https://arxiv.org/html/2605.17348#bib.bib21)\), which is specifically designed to test a model’s propensity for hallucinations\. By proactively suspending “zombie” agents without permanent removal via dynamic agent state management, our system effectively mitigates hallucinations while retaining the potential for agent recovery\.

GraphModelMMLUGSM8KAQuATruthfulQASVAMPHumanEvalAvg\.Ptok\.Ctok\.LayeredMASround=T58\.2970\.5445\.9261\.0375\.3149\.0760\.034\.3M1\.1MARG\-Designer60\.8772\.3545\.6662\.3078\.1853\.8462\.203\.5M928KAgentDropout62\.1472\.7647\.8661\.9980\.0354\.7163\.252\.8M797KAgentRevive64\.7875\.2048\.8164\.7283\.3957\.6465\.762\.2M602KRandomMASround=T59\.0169\.8844\.9361\.2277\.9650\.3460\.554\.2M1\.0MARG\-Designer61\.2873\.1644\.7260\.9380\.0553\.1162\.213\.7M945KAgentDropout61\.2473\.6547\.1264\.3278\.6555\.3663\.392\.7M834KAgentRevive63\.1974\.3751\.2265\.1681\.4858\.3365\.622\.1M713KTable 8:Performance and average token consumption achieved with different initial communication graph topologies\. “Ptok\.” and “Ctok\.” indicate prompting tokens and completion tokens of the LLMs\.\(M\)and\(K\)represent the number of tokens at the million and thousand scale, respectively\.
### D\.2Robustness Verification

#### D\.2\.1Prompt Attack

To comprehensively evaluate the resilience of multi\-agent systems against adversarial perturbations, we design a systematic robustness verification experiment featuring two distinct prompt attack strategies, as illustrated in Fig\.[5](https://arxiv.org/html/2605.17348#A3.F5)\.

We implement two targeted attack instructions to expose vulnerabilities in MAS collaboration:

1. 1\.Input Prompt Attack:This attack corrupts a specific agent node by instructing it to stubbornly rely only on its previous judgments, disregarding information from neighbors\.
2. 2\.Response Prompt Attack:This attack manipulates the output generation process by forcing a mathematics expert agent to output a random number on the first line while providing plausible explanations to maintain credibility\.

In each round, only one agent is compromised, while all other agents operate normally\.

As shown in Fig\.[6](https://arxiv.org/html/2605.17348#A3.F6), we present the average performance degradation of various models under these adversarial scenarios\. The results reveal several insights into MAS robustness:

- •Rigid Multi\-round MAS:The simpleMASround=T\\text\{MAS\}\_\{\\text\{round\}=T\}model \(i\.e\., MAS\-T\) exhibits the most severe performance degradation under both attack types\. Its vulnerability arises from a rigid node design and fixed communication patterns, which lack mechanisms for adaptation or containment of compromised agents\. Consequently, adversarial outputs propagate freely through the network\.
- •Graph Pruning Methods:All graph\-pruning\-based methods suffer performance drops, with the autoregressive ARG\-DesignerLiet al\.\([2025](https://arxiv.org/html/2605.17348#bib.bib11)\)experiencing the largest decline\. We hypothesize this is due to the lack of established neighbor relationships during graph generation, limiting the potential for collaborative correction\. In contrast, AgentDropoutWanget al\.\([2025b](https://arxiv.org/html/2605.17348#bib.bib7)\)and AgentPruneZhanget al\.\([2025a](https://arxiv.org/html/2605.17348#bib.bib6)\)show relatively better resilience, as preserved connections enable limited cross\-validation and error correction\.
- •AgentRevive:Our framework exhibits the smallest performance degradation, retaining significantly higher accuracy in both attack scenarios\. This robustness derives from our Markov state\-aware mechanism’s inherent fault tolerance: instead of permanently removing compromised nodes, the system transitions them to the “Standby” state, isolating their influence but preserving the potential for future contribution\. If collaboration context changes, the agent may be reactivated, providing dynamic recovery that hard\-pruning methods lack\.

#### D\.2\.2Graph Structure Robustness

To further validate the robustness ofAgentReviveagainst variations in initial communication topology, we conduct experiments using different graph initialization schemes, specificallyLayeredandRandomstructures with Llama3\-8B\. The primary objective of this analysis is to demonstrate that our method maintains stable task performance and token efficiency regardless of the initial graph configuration\. As shown in Table[8](https://arxiv.org/html/2605.17348#A4.T8), while these sparser structures yield slightly lower overall performance than the fully connected graph in Table[2](https://arxiv.org/html/2605.17348#S4.T2), they consistently reduce token usage due to decreased communication redundancy\. Notably, on simpler tasks \(e\.g\., AQuA\), the Layered or Random graphs sometimes surpass the fully connected baseline, likely because dense topologies become unnecessarily redundant\. These findings confirm thatAgentRevivemaintains stable performance and token efficiency across diverse graph initializations, highlighting its strong adaptability and robustness\.

![[Uncaptioned image]](https://arxiv.org/html/2605.17348v1/x7.png)![[Uncaptioned image]](https://arxiv.org/html/2605.17348v1/x8.png)![[Uncaptioned image]](https://arxiv.org/html/2605.17348v1/x9.png)![[Uncaptioned image]](https://arxiv.org/html/2605.17348v1/x10.png)![[Uncaptioned image]](https://arxiv.org/html/2605.17348v1/x11.png)![[Uncaptioned image]](https://arxiv.org/html/2605.17348v1/x12.png)![[Uncaptioned image]](https://arxiv.org/html/2605.17348v1/x13.png)

### D\.3Case Study

We present a case study using the MMLU datasetHendryckset al\.\([2021](https://arxiv.org/html/2605.17348#bib.bib31)\)to illustrate the state transition process inAgentRevive\. To conserve token computation, our framework terminates nodes based on the credibility of the current answer\.

This case empirically demonstrates that the state\-aware framework inAgentReviveenhances multi\-agent systems through flexible agent management, leading to more accurate and resilient outcomes in complex tasks such as historical text analysis\. TheStandbystate acts as a buffer that distinguishes between temporary failures and permanent incompetence, ensuring that valuable agents are retained for future contributions and ultimately improving the system’s overall performance and reliability\.
Taming "Zombie'' Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution

Similar Articles

Recursive Multi-Agent Systems

How are you handling agent memory without turning it into a junk drawer?

MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

Submit Feedback

Similar Articles

How are you handling agent memory without turning it into a junk drawer?
MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution
MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent
MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning