TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices
Summary
TopoEvo is a topology-aware self-evolving multi-agent framework for root cause analysis in microservices that couples graph representation learning with structured, topology-constrained reasoning. It achieves absolute improvements of up to 3.44% in root cause localization accuracy and boosts fault-type classification performance by 4.39% to 16.81% across diverse datasets.
View Cached Full Text
Cached at: 05/18/26, 06:33 AM
# A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices
Source: [https://arxiv.org/html/2605.15611](https://arxiv.org/html/2605.15611)
###### Abstract
Root cause analysis \(RCA\) in microservices is challenging due to \(i\) noisy and heterogeneous multimodal observability \(metrics, logs, traces\), \(ii\) cascading failure propagation that amplifies downstream symptoms, and \(iii\) non\-stationary topology drift induced by autoscaling and rolling updates\. Recent LLM\-based RCA agents can generate tool\-grounded explanations, yet they often remain topology\-agnostic and suffer from*symptom\-amplification bias*, misattributing the root cause to salient downstream victims\. We proposeTopoEvo, a topology\-aware self\-evolving multi\-agent framework that couples graph representation learning with structured, topology\-constrained reasoning\. TopoEvo first introduces*Metric\-orthogonal Multimodal Alignment*\(MOMA\), which decomposes metric embeddings into complementary subspaces and contrastively aligns logs and traces to reduce modality redundancy and sparsity, yielding stable node representations for graph encoding\. It then applies*Vector Quantization*\(VQ\) to discretize topology\-enhanced states into auditable*symptom tokens*with a symptom lexicon, enabling reliable retrieval and token\-level evidence grounding\. On top of these discrete topology cues, TopoEvo performs a multi\-agent*Hypothesis–Evidence–Test*\(HET\) workflow to explicitly verify propagation\-consistent explanations and separate initiating anomalies from amplified downstream symptoms\. Finally, a*Self\-Evolving Mechanism*refreshes hierarchical incident memory and performs conservative test\-time adaptation with high\-confidence pseudo\-labels to maintain robustness under drift\.
We evaluate TopoEvo on both a public AIOps benchmark and a real\-world production incident dataset\. Compared to the state\-of\-the\-art baselines, TopoEvo achieves absolute improvements of up to 3\.44% in root cause localization accuracy, while remarkably boosting fault\-type classification performance by 4\.39% to 16\.81% across diverse datasets\.
## IIntroduction
Microservice architectures have become a dominant paradigm for building large\-scale cloud applications due to their flexibility, support for independent deployment, and rapid iteration\.\[[1](https://arxiv.org/html/2605.15611#bib.bib1),[2](https://arxiv.org/html/2605.15611#bib.bib2),[3](https://arxiv.org/html/2605.15611#bib.bib3)\]However, this architectural shift also makes modern systems increasingly complex to operate: failures rarely stay local, and subtle issues can quickly propagate along service dependencies, producing amplified and misleading symptoms downstream\[[4](https://arxiv.org/html/2605.15611#bib.bib4),[5](https://arxiv.org/html/2605.15611#bib.bib5)\]\. As a result,*root cause analysis*\(RCA\)—identifying the initiating faulty entity and its fault type—has become a critical capability for maintaining service reliability and meeting strict QoS/SLA requirements\.
Despite substantial progress, existing RCA solutions still face three practical limitations in real microservice deployments\.First, microservices generate*multimodal observability*\(metrics, logs, traces\), yet many approaches underutilize this richness or rely on simplistic fusion\[[6](https://arxiv.org/html/2605.15611#bib.bib6),[7](https://arxiv.org/html/2605.15611#bib.bib7),[8](https://arxiv.org/html/2605.15611#bib.bib8)\]\. In particular, heterogeneous modalities differ in frequency, sparsity, noise patterns, and missingness, making*effective cross\-modal alignment*non\-trivial; naive concatenation often yields unstable representations and spurious correlations\.Second, with the rise of LLMs, multi\-agent diagnosis has emerged as a promising direction for RCA, enabling tool\-grounded inspection and human\-readable explanations\. However, most agent\-based pipelines remain*topology\-agnostic*: they struggle to internalize microservice dependency constraints and therefore tend to over\-trust the most salient downstream symptoms \(symptom amplification\), leading to inefficient investigation and frequent misattribution\[[9](https://arxiv.org/html/2605.15611#bib.bib9),[10](https://arxiv.org/html/2605.15611#bib.bib10),[11](https://arxiv.org/html/2605.15611#bib.bib11),[12](https://arxiv.org/html/2605.15611#bib.bib12)\]\.Third, microservice environments are inherently non\-stationary: autoscaling, rolling updates, and evolving call graphs introduce distribution shifts and out\-of\-distribution \(OOD\) behaviors\. Static RCA models and fixed reasoning heuristics often degrade sharply under such drift, lacking a reliable mechanism to refresh knowledge and adapt\[[13](https://arxiv.org/html/2605.15611#bib.bib13),[14](https://arxiv.org/html/2605.15611#bib.bib14)\]\.
To address these challenges, we proposeTopoEvo, a topology\-aware, reasoning\-enhanced, and self\-evolving framework for joint*microservice root cause localization*and*fault type classification*\. TopoEvo systematically connects representation learning with topology\-constrained reasoning: \(1\) a*metric\-anchored orthogonal multimodal alignment*module constructs stable shared subspaces for heterogeneous signals, mitigating modality sparsity and redundancy; \(2\) a*topology\-aware enhancement*module discretizes topology\-aware states via vector quantization \(VQ\) and builds a*symptom vocabulary*, turning opaque vectors into compact, retrievable, and auditable evidence units for reasoning; \(3\) a*reasoning\-enhanced multi\-agent*workflow follows an explicit hypothesis–evidence–test loop under topology constraints, reducing symptom\-amplification errors and improving diagnostic efficiency; and \(4\) a*self\-evolving mechanism*continuously refreshes incident knowledge and cautiously adapts the encoder under drift, improving robustness to OOD configurations\.
Our main contributions are summarized as follows:
- •Metric\-anchored orthogonal multimodal alignment\.We propose a metric\-centric alignment scheme that decomposes metric embeddings intoorthogonalsubspaces and aligns logs and traces to complementary components via contrastive learning, mitigating modality sparsity and reducing redundant correlations in multimodal observability\.
- •Topology\-aware symptom tokenization for agent reasoning\.We introduce a VQ\-based discretization of topology\-aware graph states and construct a*symptom vocabulary*that maps discrete codes to compact, auditable symptom tokens, enabling topology\-constrained reasoning and reducing symptom\-amplification misattribution in LLM\-based RCA\.
- •Reasoning\-enhanced multi\-agent RCA\.We propose a hypothesis–evidence–test \(HET\) multi\-agent workflow that explicitly models fault propagation paths and performs tool\-grounded evidence verification, yielding more reliable and explainable RCA under noisy multimodal telemetry\.
- •Self\-evolving adaptation under drift\.We introduce a self\-evolving mechanism that refreshes incident knowledge and continuously adapts the graph encoder and alignment objectives under distribution shift, improving robustness to dynamic microservice configurations and out\-of\-distribution \(OOD\) incidents\.
## IIPreliminaries and Motivation
### II\-AObservability of Microservice
Microservice observability refers to the capability of inferring internal system states and diagnosing runtime issues from external signals emitted by distributed services\. In practice, observability data are commonly summarized as three pillars:*metrics*,*logs*, and*traces*\.
Metricsare time\-stamped numerical measurements \(e\.g\., QPS, latency percentiles, error rates, CPU/memory usage\) that provide compact and continuous views of service health and resource consumption\.
Logsare discrete event records produced by services and infrastructure components, typically containing semi\-structured messages, levels, and contextual fields; they capture rich semantic clues about failures but are often noisy and heterogeneous\.
Tracesrecord end\-to\-end request executions across services, usually organized as a trace graph of*spans*with parent–child and causal relationships; they expose cross\-service propagation paths and fine\-grained latency breakdowns\. Given an incident time window, we denote the multimodal observability of nodevvas𝒪v=\{𝐱vmetric,𝐱vlog,𝐱vtrace\}\\mathcal\{O\}\_\{v\}=\\\{\\mathbf\{x\}\_\{v\}^\{metric\},\\mathbf\{x\}\_\{v\}^\{log\},\\mathbf\{x\}\_\{v\}^\{trace\}\\\}, where𝐱vmetric\\mathbf\{x\}\_\{v\}^\{metric\}is a metric time\-series segment,𝐱v\(log\\mathbf\{x\}\_\{v\}^\{\(log\}is a log snippet/set, and𝐱vtrace\\mathbf\{x\}\_\{v\}^\{trace\}is the trace\-derived feature set\[[15](https://arxiv.org/html/2605.15611#bib.bib15),[16](https://arxiv.org/html/2605.15611#bib.bib16)\]\.
### II\-BVector Quantization & Codebook\.
Vector Quantization \(VQ\) maps continuous embeddings into a finite set of discrete prototypes, enabling compact and symbolic representations\[[18](https://arxiv.org/html/2605.15611#bib.bib18)\]\. Given a feature vector𝐳∈ℝd\\mathbf\{z\}\\in\\mathbb\{R\}^\{d\}, VQ assigns it to the nearest entry in a learnable codebook𝒞=\{𝐜k\}k=1K\\mathcal\{C\}=\\\{\\mathbf\{c\}\_\{k\}\\\}\_\{k=1\}^\{K\}by
k∗=argmink∈\[K\]‖𝐳−𝐜k‖2,VQ\(𝐳\)=𝐜k∗\.k^\{\*\}=\\arg\\min\_\{k\\in\[K\]\}\\ \\\|\\mathbf\{z\}\-\\mathbf\{c\}\_\{k\}\\\|\_\{2\},\\qquad\\mathrm\{VQ\}\(\\mathbf\{z\}\)=\\mathbf\{c\}\_\{k^\{\*\}\}\.\(1\)The codebook can be viewed as a dictionary of representative patterns that summarize frequently occurring structures in the embedding space\. During training, VQ modules are typically optimized with a reconstruction\-style objective \(e\.g\., VQ\-VAE\), which pulls codebook entries toward the assigned vectors while encouraging encoder outputs to commit to selected prototypes\. A common formulation is
ℒVQ=‖sg\[𝐳\]−𝐜k∗‖22⏟codebook loss\+β‖𝐳−sg\[𝐜k∗\]‖22⏟commitment loss,\\mathcal\{L\}\_\{\\mathrm\{VQ\}\}=\\underbrace\{\\\|\\mathrm\{sg\}\[\\mathbf\{z\}\]\-\\mathbf\{c\}\_\{k^\{\*\}\}\\\|\_\{2\}^\{2\}\}\_\{\\text\{codebook loss\}\}\+\\beta\\underbrace\{\\\|\\mathbf\{z\}\-\\mathrm\{sg\}\[\\mathbf\{c\}\_\{k^\{\*\}\}\]\\\|\_\{2\}^\{2\}\}\_\{\\text\{commitment loss\}\},\(2\)wheresg\[⋅\]\\mathrm\{sg\}\[\\cdot\]denotes the stop\-gradient operator andβ\\betacontrols the strength of the commitment term\. By discretizing dense vectors into code indices, VQ yields interpretable “tokens” and can reduce sensitivity to noise, which is useful when exposing learned latent patterns to downstream reasoning modules\.
### II\-CMotivation
Microservice incidents are rarely isolated\. A fault triggered at an upstream component often propagates along the dependency graph, gradually amplifying observable symptoms \(e\.g\., retries, queueing, and timeouts\) at downstream services\. This propagation nature makes RCA fundamentally a*topology\-conditioned*inference problem: the most anomalous node is not necessarily the initiating cause, and correct diagnosis requires reasoning over multi\-level entities \(node–service–pod\) and their interactions\.
#### II\-C1Motivation 1
Topology\-unaware LLM\-based RCA suffers from symptom\-amplification bias\.
Figure 1:Illustration of symptom\-amplification bias\.Recent LLM\-based RCA systems have shown promising capabilities in tool use and explanation generation, yet most of them remain largely*topology\-agnostic*\. They typically ingest multimodal observations and produce conclusions by pattern matching or narrative reasoning, without enforcing structural constraints from the microservice dependency graph\. This omission leads to a systematic failure mode:*symptom\-amplification bias*\(shown in Fig\.[1](https://arxiv.org/html/2605.15611#S2.F1)\)\. When latency and error signals accumulate along a call chain, downstream services may exhibit the strongest symptoms, so an unconstrained reasoning process tends to select the most salient downstream node as the root cause, even though it is merely a victim of upstream propagation\.
This challenge is exacerbated by the*multi\-level*nature of microservice systems\. A root cause may originate from a pod\-level resource contention, a service\-level misconfiguration, or a node\-level failure, while the observable “peak anomaly” may appear at a different level\. Therefore, simply providing an LLM with raw logs/traces/metrics is insufficient: the model needs an explicit mechanism to*perceive*and*operate on*hierarchical topology evidence\. This motivates TopoEvo to couple a GAT\-based localizer with topology\-aware vector quantization \(VQ\) and a symptom lexicon, transforming dense topology\-aware states into discrete, retrievable symptom tokens\. These tokens serve as compact evidence units that allow the reasoning layer to stay anchored to propagation structure, explicitly separating*initiating anomalies*from*amplified downstream symptoms*\.
#### II\-C2Motivation 2
Dynamic topology requires reasoning\-enhanced adaptation under OOD drift\. Microservice environments are highly non\-stationary\. Autoscaling continuously adds/removes pods, rolling updates change service versions, and configuration changes alter call patterns and dependency edges\. Such dynamics induce distribution shift in both graph topology and observability, causing static RCA models to degrade—even when the fault semantics remain similar\. In practice, the same fault type may manifest differently after a topology change, and previously reliable patterns can become out\-of\-distribution \(OOD\)\.
This motivates an RCA design that is robust to topology drift\. On one hand, LLM\-based reasoning provides a natural advantage: it can test hypotheses, reconcile incomplete evidence, and generalize beyond exact pattern matches\. On the other hand, reasoning alone is not enough for sustained performance: the system must*retain*validated diagnostic knowledge and*adapt*to new topologies quickly\. Therefore, TopoEvo introduces a reasoning\-enhanced multi\-agent workflow \(hypothesis–evidence–test\) to explicitly verify causal explanations under topology constraints, and a self\-evolving mechanism that \(1\) refreshes hierarchical incident memory and \(2\) performs conservative test\-time adaptation using high\-confidence pseudo\-labels\. Together, these mechanisms enable TopoEvo to leverage strong OOD reasoning while continuously aligning its encoder and knowledge base with evolving service dependencies\.
## IIIMethodology
Figure 2:The overview of TopoEvo\.As shown in Fig\.[2](https://arxiv.org/html/2605.15611#S3.F2), TopoEvo consists of 5 components that support joint microservice root cause localization and fault type classification: \(1\) data process and dependency graph construction \(2\) metric\-orthogonal multimodal alignment on a fine\-grained dependency graph, \(3\) topology\-aware enhancement via vector quantization and symptom vocabulary, \(4\) reasoning\-enhanced multi\-agent diagnosis via hypothesis–evidence–test, and \(5\) a self\-evolving mechanism using hierarchical memory and test\-time adaptation\.
### III\-AData Process and Dependency Graph Construction
#### III\-A1multimodal data preprocessing
For each entityv∈Vv\\in V, metrics, logs, and traces are encoded into modality embeddings\.
- •Metricsignals are represented as a normalized multivariate time series𝐌∈ℝL×D\\mathbf\{M\}\\in\\mathbb\{R\}^\{L\\times D\}and segmented into overlapping temporal patches \(widthww, stridess\) to capture local dynamics\. Each patch is encoded by a TCN and projected by an MLP to obtain the metric embedding𝐱vmetric∈ℝEm\\mathbf\{x\}^\{\\text\{metric\}\}\_\{v\}\\in\\mathbb\{R\}^\{E\_\{m\}\}\.
- •Tracesare parsed into span relations and aggregated into window\-level statistics for each entity \(e\.g\., latency, error rate, call frequency\) after normalization\. A 1D dilated CNN followed by an MLP produces the trace embedding𝐱vtrace\\mathbf\{x\}^\{\\text\{trace\}\}\_\{v\}\.
- •Logsare parsed into templates using Drain3, and each window is represented by a PF\-IDF vector that reweights template frequency by inverse document frequency across windows\. This PF\-IDF representation is passed through a lightweight projector to obtain the log embedding𝐱vlog\\mathbf\{x\}^\{\\text\{log\}\}\_\{v\}\.
#### III\-A2Construction of a fine\-grained service dependency graph
Although the microservice system intrinsically contains entities at three granularities \(node, service, pod\), we flatten it and construct a*homogeneous*directed graphG=\(V,E\)G=\(V,E\)for unified representation learning\. A type functionτ\(v\)∈\{Node,Service,Pod\}\\tau\(v\)\\in\\\{\\textsc\{Node\},\\textsc\{Service\},\\textsc\{Pod\}\\\}is retained merely as a categorical feature to record the semantic role of each node\. The edge setEEencodes connectivity derived from both interaction/propagation relations \(e\.g\., service\-to\-service or pod\-to\-pod calls\) and structural relations \(e\.g\., pod\-to\-service membership and pod\-to\-node placement\)\. In this homogeneous formulation, the graph provides unified*structural constraints*for message passing across all entity levels, while multimodal observability and type information are injected through node features\.
### III\-BMetric\-Orthogonal Multimodal Alignment on a Fine\-grained Dependency Graph
#### III\-B1Orthogonal regularization and contrastive alignment
TopoEvo anchors alignment on metrics, which are continuous, high\-frequency, and consistently available, providing a stable reference under noisy or missing observability from logs and traces\. This metric\-centric alignment reduces ambiguity from sparse modalities and yields a well\-posed shared space for cross\-entity comparison\.
To prevent collapse where logs and traces align to the same metric factors, we introduce an orthogonal decomposition of metric features into two complementary components, dedicated to log\-consistent and trace\-consistent alignment, respectively, thereby preserving information while reducing redundancy in downstream graph reasoning\.
Given metric embeddings𝐱vmetric\\mathbf\{x\}^\{\\text\{metric\}\}\_\{v\}, two components are produced as
𝐮v=𝐖u𝐱vmetric,𝐯v=𝐖v𝐱vmetric,\\mathbf\{u\}\_\{v\}=\\mathbf\{W\}\_\{u\}\\mathbf\{x\}^\{\\text\{metric\}\}\_\{v\},\\qquad\\mathbf\{v\}\_\{v\}=\\mathbf\{W\}\_\{v\}\\mathbf\{x\}^\{\\text\{metric\}\}\_\{v\},\(3\)where𝐮v\\mathbf\{u\}\_\{v\}is intended to capture metric factors most consistent with log evidence, while𝐯v\\mathbf\{v\}\_\{v\}captures complementary factors most consistent with trace evidence\. To encourage these components to span different subspaces, an orthogonality regularizer is imposed:
ℒ⟂=∑v∈ℬ\(𝐮v⊤𝐯v‖𝐮v‖2‖𝐯v‖2\+ϵ\)2\.\\mathcal\{L\}\_\{\\perp\}=\\sum\_\{v\\in\\mathcal\{B\}\}\\left\(\\frac\{\\mathbf\{u\}\_\{v\}^\{\\top\}\\mathbf\{v\}\_\{v\}\}\{\\\|\\mathbf\{u\}\_\{v\}\\\|\_\{2\}\\,\\\|\\mathbf\{v\}\_\{v\}\\\|\_\{2\}\+\\epsilon\}\\right\)^\{\\\!2\}\.\(4\)
Modality alignment is performed with InfoNCE over a mini\-batchℬ\\mathcal\{B\}:
ℒnce\(𝐚,𝐛\)=−∑v∈ℬlogexp\(sim\(𝐚v,𝐛v\)/τ\)∑v′∈ℬexp\(sim\(𝐚v,𝐛v′\)/τ\),\\mathcal\{L\}\_\{\\mathrm\{nce\}\}\(\\mathbf\{a\},\\mathbf\{b\}\)=\-\\sum\_\{v\\in\\mathcal\{B\}\}\\log\\frac\{\\exp\(\\mathrm\{sim\}\(\\mathbf\{a\}\_\{v\},\\mathbf\{b\}\_\{v\}\)/\\tau\)\}\{\\sum\_\{v^\{\\prime\}\\in\\mathcal\{B\}\}\\exp\(\\mathrm\{sim\}\(\\mathbf\{a\}\_\{v\},\\mathbf\{b\}\_\{v^\{\\prime\}\}\)/\\tau\)\},\(5\)yieldingℒlog↔u=ℒnce\(𝐱log,𝐮\)\\mathcal\{L\}\_\{\\text\{log\}\\leftrightarrow u\}=\\mathcal\{L\}\_\{\\mathrm\{nce\}\}\(\\mathbf\{x\}^\{\\text\{log\}\},\\mathbf\{u\}\)andℒtrace↔v=ℒnce\(𝐱trace,𝐯\)\\mathcal\{L\}\_\{\\text\{trace\}\\leftrightarrow v\}=\\mathcal\{L\}\_\{\\mathrm\{nce\}\}\(\\mathbf\{x\}^\{\\text\{trace\}\},\\mathbf\{v\}\)\. Overall, the alignment objective combines contrastive alignment and orthogonality:
ℒalign=ℒlog↔u\+ℒtrace↔v\+λ⟂ℒ⟂\.\\mathcal\{L\}\_\{\\mathrm\{align\}\}=\\mathcal\{L\}\_\{\\text\{log\}\\leftrightarrow u\}\+\\mathcal\{L\}\_\{\\text\{trace\}\\leftrightarrow v\}\+\\lambda\_\{\\perp\}\\mathcal\{L\}\_\{\\perp\}\.\(6\)
To improve training stability and mitigate oscillations, we pretrain the alignment module first using the alignment lossℒalign\\mathcal\{L\}\_\{\\mathrm\{align\}\}before proceeding to end\-to\-end optimization\.
#### III\-B2GAT\-based topology\-aware representation learning for root cause analysis
The node input feature is obtained by fusing modality embeddings
𝐱v=ψn\(\[𝐱vmetric;𝐱vlog;𝐱vtrace\]\)\.\\mathbf\{x\}\_\{v\}=\\psi\_\{n\}\\big\(\[\\mathbf\{x\}^\{\\text\{metric\}\}\_\{v\};\\ \\mathbf\{x\}^\{\\text\{log\}\}\_\{v\};\\ \\mathbf\{x\}^\{\\text\{trace\}\}\_\{v\}\]\\big\)\.\(7\)A GAT encoder is applied onGG\. Let𝐡v\(0\)=𝐱v\\mathbf\{h\}\_\{v\}^\{\(0\)\}=\\mathbf\{x\}\_\{v\}\. For each layerℓ\\elland edge\(j→i\)\(j\\\!\\rightarrow\\\!i\),
eij\(ℓ\)=LeakyReLU\(𝐖\(ℓ\)𝐡i\(ℓ\)∥𝐖\(ℓ\)𝐡j\(ℓ\)\),e\_\{ij\}^\{\(\\ell\)\}=\\mathrm\{LeakyReLU\}\\\!\\left\(\\mathbf\{W\}^\{\(\\ell\)\}\\mathbf\{h\}\_\{i\}^\{\(\\ell\)\}\\ \\\|\\ \\mathbf\{W\}^\{\(\\ell\)\}\\mathbf\{h\}\_\{j\}^\{\(\\ell\)\}\\right\),\(8\)αij\(ℓ\)=exp\(eij\(ℓ\)\)∑j′∈𝒩\(i\)exp\(eij′\(ℓ\)\)\.\\alpha\_\{ij\}^\{\(\\ell\)\}=\\frac\{\\exp\(e\_\{ij\}^\{\(\\ell\)\}\)\}\{\\sum\_\{j^\{\\prime\}\\in\\mathcal\{N\}\(i\)\}\\exp\(e\_\{ij^\{\\prime\}\}^\{\(\\ell\)\}\)\}\.\(9\)𝐡i\(ℓ\+1\)=σ\(∑j∈𝒩\(i\)αij\(ℓ\)𝐖\(ℓ\)𝐡j\(ℓ\)\)\.\\mathbf\{h\}\_\{i\}^\{\(\\ell\+1\)\}=\\sigma\\\!\\left\(\\sum\_\{j\\in\\mathcal\{N\}\(i\)\}\\alpha\_\{ij\}^\{\(\\ell\)\}\\mathbf\{W\}^\{\(\\ell\)\}\\mathbf\{h\}\_\{j\}^\{\(\\ell\)\}\\right\)\.\(10\)The final topology\-aware representation is𝐡v=𝐡v\(L\)\\mathbf\{h\}\_\{v\}=\\mathbf\{h\}\_\{v\}^\{\(L\)\}\.
For*Root\-cause localization*, we apply an MLP to produce per\-entity probabilities:
𝐩rcl=softmax\(MLP\(𝐡\)\),\\mathbf\{p\}^\{\\mathrm\{rcl\}\}=\\mathrm\{softmax\}\\\!\\left\(\\mathrm\{MLP\}\(\\mathbf\{h\}\)\\right\),\(11\)where𝐡\\mathbf\{h\}stacks𝐡v\\mathbf\{h\}\_\{v\}for allv∈Vv\\in Vand𝐩vrca\\mathbf\{p\}^\{\\mathrm\{rca\}\}\_\{v\}is the predicted root\-cause probability of entityvv\. The objective of root cause localization is
ℒrcl=−1\|𝒟\|∑\(G,y\)∈𝒟∑v∈V𝕀\[i=y\]logpvrcl\(G\)\.\\mathcal\{L\}\_\{\\mathrm\{rcl\}\}=\-\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{\(G,y\)\\in\\mathcal\{D\}\}\\sum\_\{v\\in V\}\\mathbb\{I\}\[i=y\]\\log p\_\{v\}^\{\\mathrm\{rcl\}\}\(G\)\.\(12\)where\|𝒟\|\|\\mathcal\{D\}\|is the dataset size\.
For*fault\-type classification*, we first derive a graph\-level representation via attentive pooling and then predict the fault\-type distribution by an MLP followed by softmax:
𝐠=∑v∈Vαv𝐡v,𝐲^=softmax\(MLP\(𝐠\)\),\\mathbf\{g\}=\\sum\_\{v\\in V\}\\alpha\_\{v\}\\mathbf\{h\}\_\{v\},\\qquad\\hat\{\\mathbf\{y\}\}=\\mathrm\{softmax\}\(\\mathrm\{MLP\}\(\\mathbf\{g\}\)\),\(13\)whereαv\\alpha\_\{v\}is produced by a shallow scoring function on𝐡v\\mathbf\{h\}\_\{v\}and normalized overVV\.
Letycls∈\{1,…,C\}y^\{\\mathrm\{cls\}\}\\in\\\{1,\\ldots,C\\\}be the ground\-truth fault type\. The objective of fault\-type classification is
ℒcls=−1\|𝒟\|∑\(G,ycls\)∈𝒟∑c=1C𝕀\[c=ycls\]logy^c\.\\mathcal\{L\}\_\{\\mathrm\{cls\}\}=\-\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{\(G,y^\{\\mathrm\{cls\}\}\)\\in\\mathcal\{D\}\}\\sum\_\{c=1\}^\{C\}\\mathbb\{I\}\[c=y^\{\\mathrm\{cls\}\}\]\\log\\hat\{y\}\_\{c\}\.\(14\)
The joint objective is
ℒjoint=ℒrcl\+λclsℒcls\.\\mathcal\{L\}\_\{\\mathrm\{joint\}\}=\\mathcal\{L\}\_\{\\mathrm\{rcl\}\}\+\\lambda\_\{\\mathrm\{cls\}\}\\mathcal\{L\}\_\{\\mathrm\{cls\}\}\.\(15\)
### III\-CTopology\-Aware Enhancement via Vector Quantization and Symptom Vocabulary
As shown in Fig\.[3](https://arxiv.org/html/2605.15611#S3.F3), This component is central to TopoEvo: it converts dense graph representations into discrete, retrievable, and auditablesymptom tokens, enabling the reasoning process to operate on compact evidence rather than opaque vectors\.
#### III\-C1Vector quantization of topology\-aware states
Vector quantization maps topology\-aware node representations𝐡v\\mathbf\{h\}\_\{v\}to a finite codebook𝒞=\{𝐜k\}k=1K\\mathcal\{C\}=\\\{\\mathbf\{c\}\_\{k\}\\\}\_\{k=1\}^\{K\}\. For each nodevv, the nearest code index and its quantized representation are:
qv=argmink‖𝐡v−𝐜k‖22,𝐡^v=𝐜qv\.q\_\{v\}=\\arg\\min\_\{k\}\\\|\\mathbf\{h\}\_\{v\}\-\\mathbf\{c\}\_\{k\}\\\|\_\{2\}^\{2\},\\qquad\\hat\{\\mathbf\{h\}\}\_\{v\}=\\mathbf\{c\}\_\{q\_\{v\}\}\.\(16\)The codebook is learned with the standard VQ objective using stop\-gradientsg\[⋅\]\\mathrm\{sg\}\[\\cdot\]:
ℒvq=∑v‖sg\[𝐡v\]−𝐡^v‖22\+β∑v‖𝐡v−sg\[𝐡^v\]‖22\.\\mathcal\{L\}\_\{\\mathrm\{vq\}\}=\\sum\_\{v\}\\big\\\|\\mathrm\{sg\}\[\\mathbf\{h\}\_\{v\}\]\-\\hat\{\\mathbf\{h\}\}\_\{v\}\\big\\\|\_\{2\}^\{2\}\+\\beta\\sum\_\{v\}\\big\\\|\\mathbf\{h\}\_\{v\}\-\\mathrm\{sg\}\[\\hat\{\\mathbf\{h\}\}\_\{v\}\]\\big\\\|\_\{2\}^\{2\}\.\(17\)Quantization acts as a bottleneck that suppresses incident\-specific noise, encourages clustering of recurring failure manifestations, and provides stable discrete indices for retrieval\.
#### III\-C2Symptom vocabulary construction
Asymptom vocabularytranslates each discrete code into an interpretable evidence unit\. After quantization, each node is assigned to a codeqvq\_\{v\}, and nodes mapped to a codekkform a cluster𝒱k=\{v∣qv=k\}\\mathcal\{V\}\_\{k\}=\\\{v\\mid q\_\{v\}=k\\\}\. Each code is treated as a reusable*symptom prototype*; instead of storing opaque vectors, TopoEvo attaches a compact descriptor that summarizes what tends to be abnormal when nodes fall into this cluster and how such abnormalities manifest in the system topology\.
Figure 3:Illustration of vector quantization and symptom vocabulary construction\.For each codekk, a cluster\-conditioned signature is estimated by aggregating observability evidence of nodes in𝒱k\\mathcal\{V\}\_\{k\}across training incidents\. On themetricside, dominant KPI patterns are identified by ranking KPI dimensions using deviation statistics within𝒱k\\mathcal\{V\}\_\{k\}\(e\.g\., typical z\-score percentiles and frequency of crossing anomaly thresholds\); the vocabulary keeps only a small set of representative KPIs together with typical magnitude bands to remain concise and robust\. On thelogside, raw messages are compressed into templates \(e\.g\., Drain\-style parsing\) and the descriptor records templates that frequently co\-occur with the code together with burstiness indicators, yielding stable textual anchors despite surface\-form variability\. On thetraceside, propagation\-relevant evidence is captured by recording recurring abnormal interaction patterns associated with the cluster, allowing the vocabulary to reflect whether anomalies are typically “incoming” \(victim\-like\) or “outgoing” \(source\-like\) along propagation\. Finally, lightweight topology context such as neighbor\-type distributions and degree profiles is attached to ground the token in the node’s structural role\.
Each codekkis mapped to a structured symptom token,
SymptomToken\(k\)\\displaystyle\\textsc\{SymptomToken\}\(k\)=\{ID=k,Summary\(k\),\\displaystyle=\\\{\\textsc\{ID\}=k,\\ \\textsc\{Summary\}\(k\),\(18\)EvidenceSignature\(k\)\}\\displaystyle\\qquad\\textsc\{EvidenceSignature\}\(k\)\\\}whereSummaryis a short natural\-language descriptor distilled from dominant cluster\-conditioned patterns, andEvidenceSignaturestores compact, verifiable pointers to the underlying statistics \(KPI identifiers and value bands, template identifiers and burst scores, and interaction\-pattern descriptors\)\. This symptom vocabulary makes latent evidence auditable and provides stable retrieval keys across incidents, enabling topology\-scoped token querying without exposing high\-dimensional embeddings\.
### III\-DReasoning\-Enhanced Multi\-Agent Diagnosis via Hypothesis–Evidence–Test
Instead of asking a single model to read all signals and directly output a root cause, TopoEvo decomposes diagnosis into 5 specialized roles that communicate through structured artifacts: ahypothesis planner\(main prompt is shown in Fig\.[4](https://arxiv.org/html/2605.15611#S3.F4)\) proposes candidate explanations under topology constraints, modality\-specific agents \(metric agent, log agent, trace agent\) collect verifiable evidence, and ajudge agentperforms constraint\-based verification and explicitly eliminates strong alternatives\. Discrete symptom tokens from the lexicon provide compact, consistent context for these agents, while topology saliency focuses attention on propagation\-relevant regions\.
#### III\-D1Shared context
Candidate\-centric subgraph and tokenized evidence\. A candidate\-centric subgraph is constructed to keep the workflow focused\. TopoEvo selects the top\-KKcandidate nodes according to the root\-cause scores\(v\)s\(v\)and extracts anHH\-hop induced subgraph around them\. All subsequent reasoning and tool queries are restricted to this subgraph, while nodes outside it are masked to prevent distraction by weakly related services\. Within this region, each node is mapped to a small set of symptom tokens by querying the symptom vocabulary using its VQ codeqvq\_\{v\}\. Together with the graph structure, the tokenized subgraph forms a shared, compact context that is passed to all agents\.
#### III\-D2Hypothesis generation, evidence acquisition, and verification
Hypothesis generationis*structure\-prior\-driven*rather than free\-form\. Starting from the top\-KKcandidates, the planner uses saliency to highlight likely propagation routes, preferring paths that connect a candidate to downstream symptomatic nodes through high\-saliency neighborhoods\. Symptom tokens provide a compact description of what is abnormal at each node, allowing the planner to form hypotheses that are concrete enough to verify \(e\.g\., “candidateccexhibits a saturation\-like token; downstream nodes exhibit timeout\-like tokens along a salient chain”\)\. To keep the hypothesis set small and diverse, near\-duplicate hypotheses are merged by sharing route prefixes or dominant token patterns\.
Figure 4:Main prompt of Hypothesis Planner\.Evidence acquisitionfollows the planner’s query plan and is intentionally*tool\-grounded*\. Instead of returning narrative explanations, modality agents return only structured evidence tied to entities in the candidate\-centric subgraph, including timestamped onset estimates, magnitude bands, and template IDs or span summaries that can be checked later\. This design prevents the workflow from being dominated by persuasive but unverifiable text: every claim used by the judge must be backed by cached evidence\.
Verification makes the reasoning step explicit\. For each hypothesis, the judge checks temporal precedence \(whether the proposed root cause becomes abnormal no later than downstream symptoms within a slack\), path consistency \(whether evidence is located along the hypothesized route and consistent with the dependency graph\), and template consistency \(whether the collected evidence matches the intended failure template under strict/relaxed criteria to handle partial observability\)\. The outcome is summarized with supporting evidence, conflicting evidence, and missing\-but\-expected evidence, making the result auditable\.
#### III\-D3Decision with explicit alternatives
The final output is not only a top\-1 root cause, but also an explicit elimination of strong alternatives\. TopoEvo reports the 4 most competitive alternative hypotheses and explains why they are rejected, attributing rejection to either*evidence conflict*\(clear contradictions under the above constraints\) or*missing evidence*\(key signatures required by the route/template are absent\)\. This “decision with alternatives” directly exposes the benefit of topology grounding and symptom\-token querying: conclusions are tied to verifiable evidence under explicit structural constraints, rather than produced as unchecked narratives\.
### III\-ESelf\-Evolving Mechanism: Hierarchical Memory and Test\-Time Adaptation
This component supports continual effectiveness under non\-stationary systems by refreshing incident knowledge and cautiously adapting the encoder when reliable new supervision emerges\.
#### III\-E1Hierarchical incident memory with stochastic forgetting
A persistent memoryℳ\\mathcal\{M\}stores solved incidents as compact records consisting of candidate\-centric topology fingerprints, symptom\-token sets, validated hypotheses with key evidence, and mitigation outcomes\. Updates follow a hierarchical policy: pod\-level patterns are refreshed more aggressively than service\- and node\-level patterns, stabilizing slow\-changing infrastructure knowledge while tracking fast\-changing deployment dynamics\.
To prevent redundancy collapse, similarity\-aware stochastic forgetting is applied when inserting a new incident representation𝐮new\\mathbf\{u\}\_\{\\mathrm\{new\}\}\. Letsim\(𝐮new,𝐮i\)\\mathrm\{sim\}\(\\mathbf\{u\}\_\{\\mathrm\{new\}\},\\mathbf\{u\}\_\{i\}\)denote cosine similarity and𝒩τ\\mathcal\{N\}\_\{\\tau\}be the set of memories above thresholdτ\\tau\. If𝒩τ≠∅\\mathcal\{N\}\_\{\\tau\}\\neq\\emptyset, one similar item is forgotten with probability proportional to similarity:
p\(i∣𝒩τ\)=exp\(sim\(𝐮new,𝐮i\)/γ\)∑j∈𝒩τexp\(sim\(𝐮new,𝐮j\)/γ\)\.p\(i\\mid\\mathcal\{N\}\_\{\\tau\}\)=\\frac\{\\exp\(\\mathrm\{sim\}\(\\mathbf\{u\}\_\{\\mathrm\{new\}\},\\mathbf\{u\}\_\{i\}\)/\\gamma\)\}\{\\sum\_\{j\\in\\mathcal\{N\}\_\{\\tau\}\}\\exp\(\\mathrm\{sim\}\(\\mathbf\{u\}\_\{\\mathrm\{new\}\},\\mathbf\{u\}\_\{j\}\)/\\gamma\)\}\.\(19\)
#### III\-E2Test\-time adaptation with high\-confidence pseudo\-labels
During deployment, high\-confidence diagnoses \(strong support, low missingness, and consistent topology constraints\) are treated as pseudo\-labeled samples and accumulated in a bufferℬtta\\mathcal\{B\}\_\{\\mathrm\{tta\}\}\. Once the buffer reaches a batch size, the encoder is updated with a conservative objective:
ℒtta=ℒrcapseudo\+λreg‖θ−θ0‖22,\\mathcal\{L\}\_\{\\mathrm\{tta\}\}=\\mathcal\{L\}\_\{\\mathrm\{rca\}\}^\{\\mathrm\{pseudo\}\}\+\\lambda\_\{\\mathrm\{reg\}\}\\\|\\theta\-\\theta\_\{0\}\\\|\_\{2\}^\{2\},\(20\)whereθ0\\theta\_\{0\}are pre\-deployment parameters and the regularizer mitigates catastrophic drift\.
At inference time, the same encoder produces topology\-aware representations and symptom tokens, while memory refresh and test\-time adaptation enable continual robustness under evolving microservice behavior\.
TABLE I:Experimental results of different approaches on root cause localization and fault classification\. Best results are bolded, and second\-best results are underlined\.
## IVExperiments
### IV\-AExperiment Setup
#### IV\-A1Dataset and Preprocessing
Dataset A is a large\-scale public benchmark released by the AIOps Challenge\[[3](https://arxiv.org/html/2605.15611#bib.bib3)\]\. It is collected via controlled fault injection on a real\-world deployed microservice system,HipsterShop2\. The dataset provides multimodal observability signals, including*metrics, logs, and traces*\. HipsterShop2 is deployed on a dynamic Kubernetes \(K8s\) cluster with 10 services, each replicated by 4 pods \(40 pods in total\), and the pods are dynamically scheduled across 6 nodes\. The benchmark covers 15 fault types in total: 9 service/pod\-level faults \(in the K8s container context\) and 6 node\-level faults, including sudden memory pressure, disk space exhaustion, disk I/O anomalies, CPU pressure, and gradual CPU slowdown\.
Dataset B is a real\-world dataset collected from a productionProject Management Platformoperated by an Electric Power Information enterprise\. Unlike Dataset A, the incidents in Dataset B are captured under*real operating conditions*rather than injected failures\. The platform contains 12 microservices and 48 pods, and the dataset records multimodal observability \(metrics, logs, and traces\) during real incidents\. Faults in Dataset B span 5 categories: CPU hog, memory leak, network delay, packet loss, and disk payload overload\.
#### IV\-A2Baselines
To comprehensively evaluateTopoEvo, we compare against representative RCA approaches covering multimodal learning, graph\-based localization, and LLM/agent\-based diagnosis\.Nezha\[[37](https://arxiv.org/html/2605.15611#bib.bib37)\]jointly encodes metrics, logs, and traces into a shared space with contrastive alignment and performs graph\-based reasoning to localize causally relevant services under noisy/partial observability\.Eadro\[[38](https://arxiv.org/html/2605.15611#bib.bib38)\]is an end\-to\-end multi\-task framework that jointly learns anomaly detection and root\-cause localization by modeling intra\-service behaviors and inter\-service dependencies from KPIs, logs, and traces\.HolisticRCA\[[39](https://arxiv.org/html/2605.15611#bib.bib39)\]performs holistic RCA in cloud\-native systems by standardizing heterogeneous observability data and reasoning over service dependencies to identify root causes across diverse failure scenarios\.TAMO\[[40](https://arxiv.org/html/2605.15611#bib.bib40)\]is a tool\-assisted LLM\-agent framework for fine\-grained RCA that integrates multimodal alignment and model tools \(e\.g\., localization/classification tools\) to support root\-cause analysis in cloud\-native systems\.RCAgent\[[9](https://arxiv.org/html/2605.15611#bib.bib9)\]is an autonomous, tool\-augmented LLM agent for practical cloud RCA, which iteratively collects evidence from observability tools and synthesizes a diagnosis\.mABC\[[11](https://arxiv.org/html/2605.15611#bib.bib11)\]is a blockchain\-inspired multi\-agent collaboration framework that reduces hallucination via decentralized voting and prevents non\-terminating loops with a step\-bounded standardized workflow\.
#### IV\-A3Metrics
We evaluate*root cause localization*and*fault type classification*For localization, we report Top\-KKaccuracy \(Acc@1/3/5\) and mean reciprocal rank \(MRR\)\. For fault\-type classification, we report micro\-/macro\-precision, micro\-/macro\-recall, and micro\-/macro\-F1 \(MiPr/MaPr,MiRe/MaRe,MiF1/MaF1\)\. For efficiency, we measure end\-to\-end diagnosis latency \(wall\-clock time per incident, including agent/tool calls when applicable\)\. For explanation quality, we report atopology faithfulnessscore, which checks whether the cited propagation paths are valid in the dependency graphGGand whether they overlap with topology\-salient \(high\-weight\) edges used by TopoEvo\.
### IV\-BImplementation
All experiments are conducted on a server equipped with an NVIDIA A100 80GB GPU and 256GB RAM\. The graph encoder adopts a two\-layer GAT architecture with 8 attention heads\. Training is performed using the Adam optimizer with a learning rate of 0\.001\. For the vector quantization module, the codebook sizeKKis set to 128\.
For all LLM\-based components, GPT\-4o\-2024\-11\-20 is used as the backbone model to ensure consistent reasoning ability across different settings\.
### IV\-CRQ1: Effectiveness of root cause localization and fault type classification
#### IV\-C1Results
Table[I](https://arxiv.org/html/2605.15611#S3.T1)summarizes the results on root cause localization and fault type classification\. Overall,TopoEvo achieves the most competitive and consistent performance across datasets, obtaining the best results on most metrics and remaining close to the top performer elsewhere\.
Forroot cause localization, TopoEvo performs best on theservice level, outperforming TAMO by1\.23/0\.91/0\.371\.23/0\.91/0\.37percentage points on Acc@1/3/5\. On the pod level, it achieves the best Acc@1 and Acc@3 \(66\.10%/81\.20%\) and ranks second on Acc@5 \(87\.30%\), only slightly below TAMO\. On the node level and Dataset B, although TAMO is stronger on Acc@1/3, TopoEvo consistently ranks second and achieves the best Acc@5, indicating stronger*top\-kkcoverage*of true root causes\.
Forfault type classification, TopoEvo shows clear advantages in balanced prediction quality, especially on F1\. OnAsA\_\{s\}, it achieves the best MaPr, MiRe, MaRe, MiF1, and MaF1; onApA\_\{p\}andAnA\_\{n\}, it again leads on most recall and F1 metrics while remaining competitive on precision\. OnBB, although TopoEvo is not the best on individual precision or recall metrics, it achieves the highestMiF1andMaF1, demonstrating stronger overall class discrimination\.
These results suggest that TopoEvo not only improves localization quality, but also provides more reliable downstream diagnosis\. This advantage comes from topology\-aware multimodal representation learning and the HET\-based multi\-agent reasoning process, which together improve candidate quality and reduce spurious decisions\.
TABLE II:Ablations on TopoEvo components\. We report root\-cause localization accuracy \(AC@1/3/5\)\. Best results are bolded, and second\-best are underlined\.Abbrev:MOMA = Metric\-Orthogonal Multimodal Alignment; VQ = Vector Quantization; HET = Hypothesis–Evidence–Test; SEM = Self\-Evolving Mechanism\.
### IV\-DRQ2: Ablation Study
Table[II](https://arxiv.org/html/2605.15611#S4.T2)reports ablations of TopoEvo on root\-cause localization \(AC@1/3/5\) across four subsets \(AsA\_\{s\},ApA\_\{p\},AnA\_\{n\}, andBB\)\. Overall, removing any key component consistently degrades performance\.
Impact of Hypothesis–Evidence–Test \(HET\)\.We use Disabling the HET reasoning loop yields the largest drop, confirming that TopoEvo is not merely a one\-shot aggregation of signals but a verification\-driven diagnosis pipeline\. In particular, w/o HET reduces AC@1 by1515–1616points on all subsets, with similar degradation on AC@3/AC@5\. This suggests that explicit hypothesis testing and alternative exclusion are essential to avoid symptom\-root confusion when anomalies propagate\.
Impact of Metric\-Orthogonal Multimodal Alignment \(MOMA\)\.Removing MOMA also causes substantial performance loss \(typically 10 points on AC@1\), indicating that metric\-centered orthogonal alignment effectively reduces modality mismatch and stabilizes node representations before GAT encoding\. For example, onApA\_\{p\}the AC@1 drops from66\.10%66\.10\\%to56\.10%56\.10\\%, and onAnA\_\{n\}from84\.10%84\.10\\%to74\.10%74\.10\\%, demonstrating its importance under correlated and noisy observability\.
Impact of Vector Quantization \(VQ\)\.Disabling VQ leads to a clear performance decrease across all subsets \(about1212points on AC@1 in most cases\), showing that discretizing node states into symptom tokens provides robust, low\-entropy evidence for downstream reasoning\. Besides improving ranking accuracy \(e\.g\.,73\.10%→61\.10%73\.10\\%\\\!\\rightarrow\\\!61\.10\\%onAsA\_\{s\}\), VQ also supports more consistent topology\-grounded explanations by enabling auditable token\-level references\.
Impact of the Self\-Evolving Mechanism \(SEM\)\.Finally, removing the self\-evolving mechanism produces a smaller but still notable drop \(≥8\\geq 8points on AC@1\), reflecting the benefit of retrieval priors and adaptation under non\-stationarity\. For instance, w/o SEM decreases AC@1 from66\.10%66\.10\\%to58\.10%58\.10\\%onApA\_\{p\}and from71\.90%71\.90\\%to63\.90%63\.90\\%onBB\. This confirms that continual experience reuse and adaptation help maintain RCA robustness as system behavior shifts\.
### IV\-ERQ3: Does VQ Learn a Better Symptom Vocabulary?
Figure 5:Parameter sensitivity analysis\. K denotes the VQ partition \(codebook size\) parameter\.To verify whether VQ contributes more than performance gain, we evaluate the quality of the learned symptom vocabulary\. We compareLearned VQwith two baselines built on the same frozen topology\-aware encoder:Random CodebookandPost\-hoc KMeans\. All methods use the same codebook sizeK=128K=128\.
We measure vocabulary quality usingToken Purity,Normalized Mutual Information \(NMI\)andIntra\-token Variance\. The first three metrics evaluate semantic alignment between token assignments and ground\-truth fault categories, while the latter two characterize cluster compactness and separability\.
As shown in Fig\.[5](https://arxiv.org/html/2605.15611#S4.F5),Learned VQ consistently outperforms both baselines on all metrics: it achieves higher Token Purity and Normalized Mutual Information, while yielding lower Intra\-token Variance\. These results show that VQ learns more compact and fault\-consistent discrete units than heuristic or post\-hoc discretization\.
Overall, the learned codebook forms a compact and discriminative symptom vocabulary, which provides more stable evidence units for retrieval and HET\-based reasoning\.
### IV\-FRQ4: Parameter Sensitivity
We study the sensitivity of TopoEvo to the VQ codebook sizeKK, which controls the granularity of discretizing topology\-aware states into symptom tokens\. Fig\.[6](https://arxiv.org/html/2605.15611#S4.F6)reports root\-cause localization performance under differentK∈\{32,64,128,256,512\}K\\in\\\{32,64,128,256,512\\\}, including overall results and representative fault types\.
#### IV\-F1Overall trend
Across bothAcc@1andAcc@5, performance improves asKKincreases from 32 to 128, and then degrades whenKKbecomes larger \(256/512\)\. The best overall performance is achieved atK=128K=128, indicating that a*moderate*codebook size provides the most effective discretization for token\-based reasoning\.
#### IV\-F2Why too smallKKhurts
WhenKKis small \(e\.g\., 32/64\), the codebook becomes overly coarse and forces heterogeneous failure manifestations to share the same discrete code\. This causes*prototype collision*: distinct symptoms \(e\.g\., saturation\-like vs\. timeout\-like patterns\) are compressed into a single token, blurring causal cues and weakening downstream hypothesis verification\. As a consequence, the symptom lexicon becomes less discriminative and the planner/judge receive ambiguous evidence, especially for faults with overlapping surface symptoms\.
#### IV\-F3Why too largeKKhurts
WhenKKis large \(256/512\), the discretization becomes overly fine\-grained, causing subtle noise and incident\-specific variations to be encoded as separate codes while splitting recurring symptoms into many near\-duplicate tokens\. This over\-segmentation leads to three main issues:
- •Sparse code usage and unstable symptom vocabulary\.LargerKKleaves fewer samples per code, making code\-level signatures noisy and reducing retrieval consistency\.
- •Weaker generalization under drift\.Fine\-grained codes are more sensitive to topology and traffic shifts, which harms cross\-incident transfer under OOD conditions\.
- •Fragmented evidence for verification\.Similar symptoms may be mapped to multiple tokens, making propagation patterns less coherent and weakening hypothesis confidence\.
Figure 6:Parameter sensitivity analysis\. K denotes the VQ partition \(codebook size\) parameter\.
### IV\-GRQ5: Case Study
##### Scenario and observation
Fig\.[7](https://arxiv.org/html/2605.15611#S4.F7)presents an incident in a microservice dependency graph where an upstream overload inpayment\-service\(podP0\) propagates touser\-service\(podU2\) and further triggers timeouts atgateway\-service\(podG1\)\. A GAT\-based root\-cause localizer \(RCL\) assigns the highest root\-cause score touser\-service\(S\(U2\)=0\.51S\(\\textbf\{U2\}\)=0\.51\), while the true root causepayment\-service/P0is only ranked second \(S\(P0\)=0\.42S\(\\textbf\{P0\}\)=0\.42\)\. This mismatch is a typical*symptom\-amplification bias*: downstream services may exhibit stronger and more visible symptoms \(e\.g\., retry bursts and timeouts\), misleading symptom\-driven rankers\.
##### Why GAT misranks the root cause
In this incident, the most salient anomaly manifests atuser\-serviceas a bursty retry/timeout pattern, which yields stronger multimodal evidence aggregated at nodeU2\. Meanwhile, the true causeP0primarily shows early\-stage resource saturation \(CPU stress\) that may be less visually dominant at the service level\. As a result, a purely score\-based ranking tends to select the node with the largest observed symptom magnitude rather than the node that*causally initiates*the propagation\.
##### TopoEvo
topology evidence→\\rightarrowhypothesis planning→\\rightarrowtool\-grounded verification\. TopoEvo corrects this failure mode by converting “score ranking” into “topology\-constrained causal adjudication”\. Starting from the GAT candidate list \(Top\-1:U2, Top\-2:P0\), TopoEvo constructs a compact*Hierarchical Evidence Trace*\(HET\) package that explicitly aligns symptoms with topology across levels:*service path*payment→\\rightarrowuser→\\rightarrowgateway,*pod path*P0→\\rightarrowU2→\\rightarrowG1, and*token trace*Saturation→\\rightarrowretry burst→\\rightarrowTimeout\. These structured traces are summarized into discrete symptom tokens \(e\.g\., Token \#7: saturation/overload; Token \#12: timeout\), which serve as interpretable evidence units for the reasoning layer\.
Given the evidence package, the*Hypothesis Planner*generates explicit, topology\-consistent alternatives:H1:payment\-service/P0overload→\\rightarrowtimeout propagation togateway,H2:gateway\-service/g1timeout is the primary root \(alternative\)\. Unlike direct ranking, hypotheses are required to respect causal direction and propagation reachability in the dependency graph\.
Next, TopoEvo launches tool\-grounded agents to verify each hypothesis using multimodal signals: \(1\) a metric agent checks onset time and threshold crossing \(CPU/latency bands\); \(2\) a log agent checks template IDs and burst scores\(retry bursts atU2\); \(3\) a trace agent verifies abnormal span chains and their directionality alongP0→\\rightarrowU2→\\rightarrowG1\. A judge module then applies a checklist\-style adjudication:*temporal precedence*\(cause precedes effect\),*path consistency*\(evidence aligns with a valid propagation path\), and*template consistency*\(log/trace patterns match the hypothesized fault type, under strict/relaxed matching\)\. In this case, the evidence supports H1: saturation atP0occurs earlier and lies on the unique propagation path to the downstream timeouts, while H2 fails temporal/path constraints becauseg1symptoms are downstream effects without upstream initiating evidence\. Therefore, TopoEvoaccepts H1andrejects H2, yielding the final decision:Root Cause:payment\-service/P0;Fault Type: CPU stress, while eliminating alternatives such asgateway\-service/G1andorder\-service/G2\.
Figure 7:In case study experiments, We injected aCPU Stressfault into the Payment Service, and the fault propagated along the path P0–U2–G1\.
## VConclusion
We presentedTopoEvo, a topology\-aware self\-evolving multi\-agent framework for microservice RCA under noisy multimodal observability and non\-stationary topology drift\. TopoEvo tightly connects representation learning and structured reasoning: \(1\)*Metric\-Orthogonal Multimodal Alignment*stabilizes cross\-modal fusion by aligning logs/traces to complementary metric subspaces; \(3\)*VQ\-based symptom tokenization*converts dense topology\-aware states into compact, retrievable, and auditable symptom evidence; \(3\) a*Hypothesis–Evidence–Test*multi\-agent workflow performs tool\-grounded verification under explicit topology constraints to mitigate symptom\-amplification misattribution; and \(4\) a*Self\-Evolving Mechanism*maintains robustness via hierarchical memory refresh and conservative pseudo\-label adaptation\.
Experiments on both injected\-fault benchmarks and real production incidents demonstrate that TopoEvo achieves strong and consistent gains in root\-cause localization and fault\-type classification across granularities, and ablation results further verify that each module contributes materially, with HET\-based verification yielding the largest benefit\. In future work, we plan to \(1\) strengthen causal validation beyond topology\-consistent verification \(e\.g\., intervention\-aware checks\), \(2\) design safer and more cost\-aware continual adaptation policies under drift, and \(3\) extend evaluation to more heterogeneous microservice platforms and observability conditions\.
## VIRelated Work
Microservice RCA has been widely studied under metrics/logs/traces, ranging from classical survey/benchmark efforts\[[1](https://arxiv.org/html/2605.15611#bib.bib1),[2](https://arxiv.org/html/2605.15611#bib.bib2),[3](https://arxiv.org/html/2605.15611#bib.bib3)\]to learning\-based multimodal localization frameworks that fuse heterogeneous telemetry for fine\-grained diagnosis\[[6](https://arxiv.org/html/2605.15611#bib.bib6),[38](https://arxiv.org/html/2605.15611#bib.bib38),[37](https://arxiv.org/html/2605.15611#bib.bib37),[39](https://arxiv.org/html/2605.15611#bib.bib39)\]\. Another line models propagation and causality via event/causal graphs or dynamic causal inference to improve robustness under cascading failures and limited observability\[[4](https://arxiv.org/html/2605.15611#bib.bib4),[22](https://arxiv.org/html/2605.15611#bib.bib22),[5](https://arxiv.org/html/2605.15611#bib.bib5),[28](https://arxiv.org/html/2605.15611#bib.bib28),[29](https://arxiv.org/html/2605.15611#bib.bib29)\]\. Recently, LLM/agent\-based RCA has emerged, leveraging tool use and iterative evidence gathering for explainable diagnosis\[[9](https://arxiv.org/html/2605.15611#bib.bib9),[10](https://arxiv.org/html/2605.15611#bib.bib10),[11](https://arxiv.org/html/2605.15611#bib.bib11),[34](https://arxiv.org/html/2605.15611#bib.bib34)\]\. Compared to purely learned localizers, LLM agents offer stronger semantic abstraction over logs and the ability to follow SOP\-like workflows, but they can be brittle under noisy tool outputs and may drift into ungrounded narratives without explicit structural constraints\[[9](https://arxiv.org/html/2605.15611#bib.bib9),[10](https://arxiv.org/html/2605.15611#bib.bib10)\]\. This motivates topology\- and evidence\-constrained reasoning that couples graph\-based localization with structured prompts and verification, mitigating symptom\-amplification and improving reliability in real deployments\.
## References
- \[1\]Soldani, J\., Brogi, A\.: Anomaly Detection and Failure Root Cause Analysis in \(Micro\)Service\-Based Cloud Applications: A Survey\. Journal of Systems and Software178, 110964 \(2021\)
- \[2\]Wang, T\., Qi, G\.: A Comprehensive Survey on Root Cause Analysis in \(Micro\) Services: Methodologies, Challenges, and Trends\. arXiv preprint arXiv:2408\.00803 \(2024\)
- \[3\]L\. Pham, H\. Zhang, H\. Ha, F\. Salim, and X\. Zhang, “RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data,” in*Companion Proceedings of the ACM Web Conference 2025 \(WWW Companion\)*, pp\. 777–780, 2025\.
- \[4\]Z\. Yao, C\. Pei, X\. Nie, et al\., “Chain\-of\-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph,” in*Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering \(FSE\)*, 2024\.
- \[5\]Pan, X\., Yu, Y\., Zhang, H\.: DyCause: Dynamic Causal Inference for Root Cause Analysis\. In: Proceedings of the 45th International Conference on Software Engineering \(ICSE\), pp\. 435–446 \(2023\)
- \[6\]Wu, G\., Liu, D\., Zhang, H\.: MMRCA: Multimodal Root Cause Analysis via Fusion of Logs, Metrics and Traces\. In: IEEE International Conference on Services Computing \(SCC\), pp\. 110–120 \(2021\)
- \[7\]Li, X\., Zhang, H\., Pham, L\., et al\.: MRCA: Metric\-Level Root Cause Analysis for Microservices via Multi\-Modal Data\. In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering \(ASE\), pp\. 1115–1127 \(2024\)
- \[8\]Lou, J\., Yu, H\., Zhang, M\., et al\.: Unimodal is Not Enough: A Benchmark and Unified Framework for Multimodal Anomaly Detection in Logs, Metrics, and Traces\. In: Proceedings of the ACM Web Conference \(WWW\), pp\. 2841–2852 \(2023\)
- \[9\]Z\. Wang, Z\. Liu, Y\. Zhao, et al\., “RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool\-Augmented Large Language Models,” in*Proceedings of the 33rd ACM International Conference on Information and Knowledge Management \(CIKM\)*, 2024\.
- \[10\]C\. Pei, Z\. Wang, F\. Liu, Z\. Li, Y\. Liu, X\. He, R\. Kang, T\. Zhang, J\. Chen, J\. Li, G\. Xie, and D\. Pei, “Flow\-of\-Action: SOP Enhanced LLM\-Based Multi\-Agent System for Root Cause Analysis,” in*Companion Proceedings of the ACM Web Conference 2025 \(WWW Companion\)*, pp\. 422–431, 2025\.
- \[11\]W\. Zhang, H\. Guo, J\. Yang, Z\. Tian, Y\. Zhang, C\. Yan, Z\. Li, T\. Li, X\. Shi, L\. Zheng, and B\. Zhang, “mABC: Multi\-Agent Blockchain\-inspired Collaboration for Root Cause Analysis in Micro\-Services Architecture,” in*Findings of the Association for Computational Linguistics: EMNLP 2024*, pp\. 4017–4033, 2024\.
- \[12\]P\. Tang, S\. Tang, H\. Pu, Z\. Miao, and Z\. Wang, “MicroRCA\-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents,”*arXiv preprint arXiv:2509\.15635*, 2025\.
- \[13\]L\. Zheng, Z\. Chen, and H\. Chen, “Online Multi\-modal Root Cause Identification in Microservice Systems,” in*2025 IEEE International Conference on Big Data \(BigData\)*, pp\. 1–8, 2025\.
- \[14\]Phan, X\., Li, M\., Zhang, Q\.: MicroCERCL: Root Cause Localization in Cloud\-Edge Microservice Environments\. arXiv preprint arXiv:2406\.13604 \(2024\)
- \[15\]Li, Z\., Chen, J\., Jiao, R\., et al\.: Practical Root Cause Localization for Microservice Systems via Trace Analysis\. In: IEEE International Conference on Cloud Computing \(CLOUD\), pp\. 25–34 \(2021\)
- \[16\]Jha, S\., et al\.: Localizing and Explaining Faults in Microservices Using Distributed Traces and Logs\. In: IEEE International Conference on Cloud Computing \(CLOUD\), pp\. 53–63 \(2022\)
- \[17\]Li, Z\., Zhang, H\., Wang, C\.: MicroIRC: Instance\-Level Root Cause Localization for Microservice Systems\. Journal of Systems and Software206, 111520 \(2023\)
- \[18\]Chillarege, R\., Bhandari, I\., Chaar, J\., et al\.: Orthogonal Defect Classification—A Concept for In\-Process Measurements\. IEEE Trans\. Softw\. Eng\.18\(11\), 943–956 \(1992\)
- \[19\]Nguyen, H\.A\., Tan, J\., Gu, X\., et al\.: PAL: Performance Anomaly Localization for Cloud Systems\. In: Proceedings of the 17th ACM SIGKDD, pp\. 1230–1238 \(2011\)
- \[20\]Jia, Z\., He, J\., Xu, X\., et al\.: Fault Localization for Microservice Systems via Graph Neural Network Learning\. In: IEEE International Conference on Services Computing \(SCC\), pp\. 218–225 \(2020\)
- \[21\]Yang, Y\., Zhou, M\., Yang, S\., et al\.: Unsupervised Root Cause Analysis for Microservice Systems via Spatio\-Temporal Graph Modeling\. In: IEEE/ACM International Conference on Software Engineering \(ICSE\), pp\. 156–166 \(2020\)
- \[22\]Guo, Y\., Wang, J\., Li, Y\., et al\.: CARR: A Causal\-Aware Neural Approach for Root Cause Localization in Cloud Systems\. In: Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining \(KDD\), pp\. 3922–3930 \(2022\)
- \[23\]Lin, Y\., Fang, Y\., Zhang, H\., et al\.: Causal Tracing for Root Cause Localization in Microservices\. In: IEEE International Symposium on Software Reliability Engineering \(ISSRE\), pp\. 33–44 \(2022\)
- \[24\]Wang, H\., Wu, Z\., Jiang, H\., et al\.: Groot: An Event\-Graph\-Based Approach for Root Cause Analysis in Industrial Settings\. arXiv preprint arXiv:2108\.00344 \(2021\)
- \[25\]A\. Ikram, S\. Chakraborty, S\. Mitra, S\. K\. Saini, S\. Bagchi, and M\. Kocaoglu, “Root Cause Analysis of Failures in Microservices through Causal Discovery,” in*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2022\.
- \[26\]Chakraborty, S\., et al\.: Root Cause Analysis of Failures in Microservices through Causal Structure Discovery\. In: Advances in Neural Information Processing Systems \(NeurIPS\) \(2022\)
- \[27\]Zhou, Y\., Chen, T\., Liu, Z\., et al\.: LatentScope: Unsupervised Root Cause Analysis with Limited Observability\. In: ACM International Conference on Measurement and Modeling of Computer Systems \(SIGMETRICS\), pp\. 1–13 \(2024\)
- \[28\]Z\. Xie, S\. Zhang, Y\. Geng, X\. Nie, Z\. Yao, L\. Xu, Y\. Sun, W\. Li, and D\. Pei, “Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space,” in*Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining \(KDD\)*, 2024\.
- \[29\]L\. Pham, H\. Ha, and H\. Zhang, “BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection,”*Proceedings of the ACM on Software Engineering*, vol\. 1, no\. FSE, pp\. 2214–2237, 2024\.
- \[30\]Neumann, A\.: Causality Based Instant Root Cause Analysis for Microservices Failure\. In: NCTA Conference Proceedings, pp\. 140–152 \(2024\)
- \[31\]Pham, L\., Ha, H\., Zhang, H\.: Root Cause Analysis for Microservice Systems Based on Causal Inference: How Far Are We? arXiv preprint arXiv:2408\.13729 \(2024\)
- \[32\]DoWhy Contributors: Root Cause Analysis of Latencies in a Microservice Architecture\. DoWhy Example Notebook \(2024\),[https://www\.pywhy\.org](https://www.pywhy.org/)
- \[33\]Zhang, H\., Pham, L\., Liu, Y\.: Intelligent Root Cause Localization in MicroService Systems\. In: ACM Symposium on Cloud Computing \(SoCC\) \(2025\)
- \[34\]L\. Wang, C\. Zhang, R\. Ding, Y\. Xu, Q\. Chen, W\. Zou, Q\. Chen, M\. Zhang, X\.\-C\. Gao, H\. Fan, S\. Rajmohan, Q\. Lin, and D\. Zhang, “Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback,” in*Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining \(KDD\)*, pp\. 5116–5125, 2023\.
- \[35\]L\. Zhang, T\. Jia, K\. Wang, W\. Hong, C\. Duan, M\. He, and Y\. Li, “Adaptive Root Cause Localization for Microservice Systems with Multi\-Agent Recursion\-of\-Thought,”*arXiv preprint arXiv:2508\.20370*, 2025\.
- \[36\]L\. Zhang, T\. Jia, Y\. Zhai, L\. Pan, C\. Duan, M\. He, M\. Jia, and Y\. Li, “Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices,”*arXiv preprint arXiv:2601\.02732*, 2026\.
- \[37\]G\. Yu, P\. Chen, Y\. Li, et al\., “Nezha: Interpretable Fine\-Grained Root Causes Analysis for Microservices on Multi\-modal Observability Data,” in*Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering \(ESEC/FSE\)*, 2023\.
- \[38\]C\. Lee, T\. Yang, Z\. Chen, Y\. Su, and M\. R\. Lyu, “Eadro: An End\-to\-End Troubleshooting Framework for Microservices on Multi\-source Data,” in*Proceedings of the 45th IEEE/ACM International Conference on Software Engineering \(ICSE\)*, pp\. 1750–1762, 2023\.
- \[39\]Y\. Han, Q\. Du, Y\. Huang, P\. Li, X\. Shi, J\. Wu, P\. Fang, F\. Tian, and C\. He, “Holistic Root Cause Analysis for Failures in Cloud\-Native Systems Through Observability Data,”*IEEE Transactions on Services Computing*, vol\. 17, no\. 6, pp\. 3789–3802, 2024\.
- \[40\]Zhang X, Wang Q, Li M, et al\. TAMO: Fine\-Grained Root Cause Analysis via Tool\-Assisted LLM Agent with Multi\-Modality Observation Data in Cloud\-Native Systems\[J\]\. IEEE Transactions on Services Computing, 2025, 18\(6\): 4221\-4233\.Similar Articles
@tom_doerr: Semi-autonomous agents optimize codebases through parallel experimentation https://github.com/evo-hq/evo
Evo is an open-source tool that provides semi-autonomous agents to optimize codebases through parallel experimentation, using tree search and multiple subagents to autonomously discover and improve metrics.
EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery
EvoSci proposes a bio-inspired multi-agent framework that integrates evolutionary algorithms with knowledge graph modeling to iteratively generate, evaluate, and refine research ideas, achieving top performance in peer-review evaluations.
EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale
EvoMaster is a scalable, self-evolving agent framework for large-scale scientific discovery that enables iterative hypothesis refinement and knowledge accumulation across experimental cycles. It achieves state-of-the-art results on four benchmarks including Humanity's Last Exam (41.1%) and MLE-Bench Lite (75.8%), outperforming general-purpose baselines by up to 316%.
EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
EvoScientist is an adaptive multi-agent framework for end-to-end scientific discovery that continuously improves through persistent memory modules, comprising three specialized agents for idea generation, experiment execution, and knowledge distillation. It outperforms 7 state-of-the-art systems in scientific idea generation and improves code execution success rates through multi-agent evolution.
MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery
MLEvolve is a self-evolving LLM-based multi-agent framework for automated ML algorithm discovery that extends tree search to Progressive MCGS with graph-based cross-branch information flow and retrospective memory. It achieves state-of-the-art performance on MLE-Bench and outperforms AlphaEvolve on mathematical algorithm optimization tasks.