STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices
Summary
STAR is a stage-attributed triage and repair framework that decomposes LLM-based RCA agent workflows into four structured stages, enabling stage-wise auditing, counterfactual evaluation, and patch-and-replay repair to improve root cause localization and fault type classification in microservice AIOps.
View Cached Full Text
Cached at: 05/18/26, 06:33 AM
# A Stage-attributed Triage and Repair framework for RCA Agents in Microservices
Source: [https://arxiv.org/html/2605.15581](https://arxiv.org/html/2605.15581)
###### Abstract
LLM\-based root cause analysis \(RCA\) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps\. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis\. In this paper, we presentSTAR, aStage\-attributed Triage and Repairframework for repairing erroneous RCA traces\. STAR explicitly decomposes an RCA workflow into four structured stages, namelyEvidence Package\(EP\),Hypothesis Set\(HS\),Analysis Structure\(AS\), andDecision Report\(DR\), and treats agent failure as a stage\-localizable reasoning bug rather than a monolithic end\-to\-end error\. Built on top of LangGraph, STAR performs stage\-wise auditing, budget\-awareFast/Slow Routing,decisive stage localization via counterfactual candidate evaluation, and stage\-specific patch\-and\-replay repair\.
We evaluate STAR on a public large\-scale benchmark and a real\-world production dataset, using two RCA agent workflows and three foundation models\. Experimental results show that STAR consistently improves both root cause localization and fault type classification over strong baselines\. Moreover, STAR identifies the decisive faulty stage with high accuracy, repairs most initially incorrect traces within one or two replay rounds, and benefits substantially from both Fast/Slow Routing and counterfactual stage evaluation\. These results suggest that explicitly modelingwherean RCA agent fails is an effective path toward reliable, debuggable, and self\-repairing agentic RCA systems\.
## IIntroduction
Microservice architectures have become the dominant paradigm for large\-scale cloud applications due to their scalability, flexibility, and support for independent deployment\. However, the same decentralization that enables rapid iteration also makes reliability engineering substantially more challenging\. A single fault can propagate across services, pods, or nodes, while its observable symptoms often appear far from the true origin\. As a result, root cause analysis \(RCA\) in microservices is both operationally important and intrinsically difficult\[[1](https://arxiv.org/html/2605.15581#bib.bib1),[2](https://arxiv.org/html/2605.15581#bib.bib2),[3](https://arxiv.org/html/2605.15581#bib.bib3),[5](https://arxiv.org/html/2605.15581#bib.bib5),[7](https://arxiv.org/html/2605.15581#bib.bib7)\]\.
Recent advances in large language models \(LLMs\) have inspired a new class ofLLM\-based RCA agentsthat reason over multimodal observability signals—metrics, logs, and traces—to infer root causes and generate diagnostic explanations\. Compared with conventional correlation\-based or graph\-based RCA pipelines, these agents are more adaptable to open\-ended environments and more capable of synthesizing heterogeneous evidence\[[26](https://arxiv.org/html/2605.15581#bib.bib26),[28](https://arxiv.org/html/2605.15581#bib.bib28),[10](https://arxiv.org/html/2605.15581#bib.bib10),[13](https://arxiv.org/html/2605.15581#bib.bib13),[14](https://arxiv.org/html/2605.15581#bib.bib14)\]\. However, their practical utility remains limited by the fragility of thereasoning process itself\. In RCA, even a minor error in evidence scoping, hypothesis formulation, or causal interpretation can propagate through later reasoning steps and ultimately lead to an incorrect diagnosis\.
This problem is particularly acute in microservice RCA because the task is inherently structured\. Correct diagnosis depends not only on textual plausibility, but also on telemetry consistency, causal reachability, temporal ordering, and deployment topology\. Consequently, debugging an RCA agent solely from raw free\-form reasoning traces is often unreliable and inefficient: such traces are noisy, decisive errors are difficult to isolate, and regenerating long trajectories is costly\. More importantly, correcting an isolated reasoning step often fails to address the actual failure source, which typically lies in a higher\-level workflow artifact, such as incomplete evidence, biased hypotheses, infeasible causal chains, or unstable final decisions\.
These observations motivate a process\-centric question: instead of only askingwhich service is faulty, can we also determinewhich stage of the RCA workflow is faulty, and repair the diagnosis by replaying only the affected downstream stages? To answer this question, we proposeSTAR\(Stage\-attributedTriageAndRepair\), a debugging and repair layer for LLM\-based RCA agents\. STAR explicitly decomposes an RCA trace into four structured artifacts:Evidence Package\(EP\),Hypothesis Set\(HS\),Analysis Structure\(AS\), andDecision Report\(DR\)\. Rather than treating agent failure as a black\-box end\-to\-end error, STAR models it as astage\-localizable reasoning bug\. It first audits the RCA trace, then identifies the decisive faulty stage, patches the corresponding artifact, and finally replays only the downstream reasoning to eliminate error propagation\.
To make this repair process practical, STAR incorporates three key mechanisms\. First, it performsstage\-wise audit and diagnosis, transforming vague failure signals into explicit stage\-level inconsistency evidence\. Second, it adoptsFast/Slow Routingto balance correction cost and effectiveness: lightweight local repair is applied to near\-miss traces, while replay\-based localization is reserved for strongly contaminated cases\. Third, STAR introducesdecisive stage localization via counterfactual candidate evaluation, in which candidate stage patches are assessed by replaying downstream stages and examining whether the repaired trace improves\. This allows STAR to identify not merely a suspicious stage, but the earliest stage whose correction can restore RCA consistency\. Built on LangGraph, STAR further leverages node\-level replay and structured state artifacts to enable controllable implementation and systematic repair analysis\.
Experiments on both a public AIOps benchmark and a real\-world production dataset show that STAR consistently improves end\-to\-end RCA performance across two agent workflows and three foundation models\. In particular, STAR substantially improves root cause localization and fault type classification over the original workflows, identifies the decisive faulty stage with high precision, and repairs most initially incorrect traces within one or two replay rounds\. Additional ablation studies further show that both Fast/Slow Routing and counterfactual decisive\-stage localization contribute significantly to repair efficiency and final diagnostic accuracy\.
In summary, this paper makes the following contributions:
- •We presentSTAR, a stage\-attributed debugging and replay framework for LLM\-based RCA agents, which decomposes RCA into four structured stages and supports stage\-wise audit, decisive stage localization, patching, and downstream replay\.
- •We design two key mechanisms for effective repair:Fast/Slow Routingfor budget\-aware correction andcounterfactual candidate evaluationfor decisive stage localization\.
- •Extensive experiments on public and real\-world datasets show that STAR consistently improves diagnosis quality, stage attribution accuracy, and repair efficiency across datasets, workflows, and backbone models\.
## IIPreliminaries and Motivation
### II\-AMicroservice Root Cause Analysis
Modern cloud\-native applications are increasingly built on microservice architectures, where functionality is decomposed into independently deployable services connected through complex runtime dependencies\. While this design improves scalability and agility, it also makes reliability management substantially more difficult: faults can propagate across services, and observed symptoms often appear far from their true origin\. As a result,root cause analysis\(RCA\) in microservices aims not only to identify the most likely faulty entity \(e\.g\., host, pod, or service\), but also to explain how the failure propagates through the system\.
In practice, effective RCA relies on multi\-modal observability signals—metrics, logs, and traces—together with dependency information that captures service interactions and deployment structure\. Prior studies have shown that robust fault localization in microservices critically depends on jointly reasoning over telemetry and topology, especially under noisy, incomplete, and dynamically changing environments\[[8](https://arxiv.org/html/2605.15581#bib.bib8),[19](https://arxiv.org/html/2605.15581#bib.bib19),[20](https://arxiv.org/html/2605.15581#bib.bib20),[21](https://arxiv.org/html/2605.15581#bib.bib21),[9](https://arxiv.org/html/2605.15581#bib.bib9),[6](https://arxiv.org/html/2605.15581#bib.bib6),[22](https://arxiv.org/html/2605.15581#bib.bib22),[23](https://arxiv.org/html/2605.15581#bib.bib23)\]\. This also makes RCA fundamentally different from generic reasoning tasks: correctness is constrained by evidence, time, and system structure rather than by textual plausibility alone\.
### II\-BMotivation
#### II\-B1Motivation 1
Reasoning Failures in RCA Agents Directly Degrade RCA Accuracy\. These characteristics make LLM\-based RCA agents particularly vulnerable to reasoning failures\. Errors in evidence scoping \(e\.g\., missing anomaly onset or focusing only on downstream victims\), premature hypothesis anchoring, infeasible causal paths, or overconfident final decisions can directly alter the ranked root\-cause candidates and therefore degrade RCA accuracy\. Unlike general text\-generation tasks, where an imperfect intermediate thought may still lead to the correct answer, RCA requires consistent intermediate commitments to telemetry, temporal order, and topology; even a small reasoning defect can therefore propagate into a large end\-to\-end error\.
#### II\-B2Motivation 2
Fine\-Grained Step\-by\-Step Failure Localization Is Unreliable and Inefficient for RCA Agents\. A natural response is to debug the agent at the level of individual reasoning steps\. However, step\-by\-step localization is often both unreliable and inefficient in RCA settings\. Fine\-grained traces are typically noisy, with many redundant or weakly causal steps, making it difficult to determine which step is truly decisive\. More importantly, correcting an isolated step rarely fixes the underlying RCA failure, which usually lies in higher\-level workflow artifacts such as evidence scope, hypothesis coverage, causal feasibility, or decision calibration\. Repeatedly inspecting and regenerating long reasoning traces also introduces substantial LLM and tool\-call overhead\. These limitations motivate a more practical alternative:stage\-levellocalization and replay\-based repair, which targets the decisive faulty stage in the RCA workflow and corrects downstream reasoning through structured patching and replay\.
## IIIProblem Statement
Following Sec\.[II](https://arxiv.org/html/2605.15581#S2), we formulate microservice RCA as a multi\-stage reasoning process over multi\-modal observability and system topology\. Instead of treating the RCA agent as a black\-box predictor, we represent its execution as a structuredprogram trace:
𝒜=\(EP,HS,AS,DR\),\\mathcal\{A\}=\(\\mathrm\{EP\},\\mathrm\{HS\},\\mathrm\{AS\},\\mathrm\{DR\}\),\(1\)where each stage corresponds to a distinct intermediate artifact in the RCA workflow\.
##### Evidence Package \(EP\)
EP\\mathrm\{EP\}defines the incident time window, the entity scope under analysis \(including host/service/pod mappings\), and an indexed set of evidence items extracted from observability signals\. Each evidence item is associated with an identifier, modality, target entity or edge, and a compact summary\.
##### Hypothesis Set \(HS\)
HS\\mathrm\{HS\}consists of a set of candidate explanations\{hi\}\\\{h\_\{i\}\\\}for the incident\. Each hypothesis is explicitly grounded in the relevant entities and supporting evidence identifiers from EP, preventing unsupported or arbitrarily chosen evidence from implicitly driving the diagnosis\.
##### Analysis Structure \(AS\)
AS\\mathrm\{AS\}captures the agent’s causal reasoning as a set of propagation paths\{pj\}\\\{p\_\{j\}\\\}\. Each path is represented as a topology\-consistent walk or subgraph over the system graph, together with textual justification and supporting evidence identifiers\. This formulation makes causal reasoning verifiable in terms of reachability, temporal consistency, and evidential support\.
##### Decision Report \(DR\)
DR\\mathrm\{DR\}outputs a ranked list of root\-cause candidates with confidence scores, along with minimal verification tests or recommended actions\. When uncertainty remains high, DR may favor a verification\-first conclusion over an overconfident localization\.
We formulate agent self\-repair asstage\-attributed correction with replay\. Let
s∈\{S1,S2,S3,S4\}s\\in\\\{S\_\{1\},S\_\{2\},S\_\{3\},S\_\{4\}\\\}\(2\)denote the stage index corresponding to EP, HS, AS, and DR, respectively\. A stage patch operator generates a corrected artifact at stagess:
𝒪′\(s\)=𝒫s\(𝒜,O,G\),\\mathcal\{O\}^\{\\prime\}\(s\)=\\mathcal\{P\}\_\{s\}\(\\mathcal\{A\},O,G\),\(3\)whereOOdenotes the incident observability andGGdenotes the system topology\. Given a patched artifact, we define a deterministic replay operator that re\-executes all downstream stages:
Replay\(𝒜,s\),\\mathrm\{Replay\}\(\\mathcal\{A\},s\),\(4\)which replaces the stage\-ssartifact with its patched version and reruns all subsequent stages throughS4S\_\{4\}\.
The central challenge is to identify thedecisive faulty stage\. We defines∗s^\{\*\}as the earliest stage such that patching its artifact and replaying all downstream stages yields a substantial improvement in trajectory reliability and/or final RCA correctness, while patching only later stages cannot consistently achieve the same effect\. This definition captures the stage\-wise contamination property of RCA agents: once an upstream artifact is flawed, downstream hypotheses, analyses, and decisions may remain systematically biased unless replay is initiated from the corrected upstream stage\.
Accordingly, given an initial RCA trace𝒜\\mathcal\{A\}, our objective is to:
1. 1\.determine whether the trace is unreliable,
2. 2\.identify the decisive faulty stages∗s^\{\*\},
3. 3\.patch only the artifact at stages∗s^\{\*\}, and
4. 4\.replay the downstream stages to obtain a repaired decision report\.
## IVMethodology
Figure 1:Overview of our proposed frameworkSTAR\.STAR consists of five tightly coupled components:stage\-wise audit,fast/slow routing,decisive stage localization,patch\-and\-replay repairandself\-Evolving Repair Memory\. As shown in Fig\.[1](https://arxiv.org/html/2605.15581#S4.F1), we proposeSTARas a process\-centric reliability layer for microservice RCA agents\. Building on the stage\-structured formulation in Sec\.[III](https://arxiv.org/html/2605.15581#S3), STAR does not re\-solve RCA from scratch; instead, it audits the RCA trace, identifies the decisive faulty stage, repairs the corresponding stage artifact, and replays only the downstream stages to remove error contamination\. This design is guided by three RCA\-specific requirements: intermediate reasoning must remain grounded in observability signals, causal explanations must be consistent with service topology and temporal order, and the final diagnosis must remain operationally actionable\.
### IV\-AStage\-wise Audit and Diagnosis
Given an RCA trace𝒜=\(EP,HS,AS,DR\)\\mathcal\{A\}=\(\\mathrm\{EP\},\\mathrm\{HS\},\\mathrm\{AS\},\\mathrm\{DR\}\), STAR first performs an RCA\-oriented audit to determine bothwhetherthe trace is unreliable andwherethe inconsistency first emerges\. The audit outputs a global reliability scoreSStogether with a set of stage diagnostics
𝐝=\{dS1,dS2,dS3,dS4\},\\mathbf\{d\}=\\\{d\_\{S1\},d\_\{S2\},d\_\{S3\},d\_\{S4\}\\\},\(5\)where eachdSid\_\{Si\}records violated constraints, severity, and blame evidence for stageSiS\_\{i\}\.
Rather than assigning confidence in a generic way, the audit is tied to the structural commitments of RCA\. For the evidence package, STAR checks whether the selected time window covers anomaly onset, whether modality coverage matches the incident type, and whether the entity scope includes plausible upstream/downstream neighborhoods rather than only downstream victims\. For the hypothesis set, STAR verifies that each hypothesis is grounded in EP evidence, that the search space is not prematurely collapsed into a single anchored explanation, and that host/service/pod interactions are considered when implied by the evidence\. For the analysis structure, STAR checks whether each causal path is reachable inGG, whether anomaly onsets respect cause\-before\-effect, and whether intermediate links are supported by telemetry\. For the decision report, STAR examines whether confidence is calibrated to evidence sufficiency, whether the ranking is consistent with the preceding analysis, and whether recommended tests are discriminative and mechanism\-consistent\.
These local checks are aggregated into a global score
S=∑kwksk,∑kwk=1,S=\\sum\_\{k\}w\_\{k\}\\,s\_\{k\},\\qquad\\sum\_\{k\}w\_\{k\}=1,\(6\)Here,sk∈\[0,1\]s\_\{k\}\\in\[0,1\]denotes the normalized score of thekk\-th audit criterion, andwkw\_\{k\}is its importance weight\. A higherSSindicates that the RCA trace is more self\-consistent and better satisfies the RCA\-specific requirements\. In this way, STAR transforms a vague signal that “the agent is wrong” into a concrete diagnosis ofwhich stage violates RCA requirements and why\.
### IV\-BFast/Slow Routing for Budget\-Aware Repair
Not all faulty traces require the same level of intervention\. In practice, some arenear\-misscases, where the overall trajectory remains largely valid and only a local inconsistency appears in a downstream artifact, while others reflectsystemic breakdowns, where upstream errors in evidence scoping or hypothesis construction have already contaminated the entire downstream reasoning chain\. Applying the same repair procedure to both cases would be inefficient: full replay\-based debugging is unnecessarily expensive for near\-miss traces, while local patching is ineffective once upstream contamination has occurred\.
To balance repair effectiveness and cost, STAR introduces a Fast/Slow routing mechanism based on the audit score:
Routing=\{Pass,ifS≥τ,Fast Path,ifτ−ϵ≤S<τ,Slow Path,ifS<τ−ϵ\.\\mathrm\{Routing\}=\\begin\{cases\}\\text\{Pass\},&\\text\{if \}S\\geq\\tau,\\\\ \\text\{Fast Path\},&\\text\{if \}\\tau\-\\epsilon\\leq S<\\tau,\\\\ \\text\{Slow Path\},&\\text\{if \}S<\\tau\-\\epsilon\.\\end\{cases\}\(7\)IfS≥τS\\geq\\tau, the trace is considered sufficiently reliable and the original decision is accepted\. IfSSfalls into the intermediate region\[τ−ϵ,τ\)\[\\tau\-\\epsilon,\\tau\), STAR activates theFast Path\. This regime is designed for traces that are mostly correct but contain a local defect\. STAR directly selects the stage with the highest diagnostic severity, applies a single constrained patch, and replays only the necessary downstream stages\. Fast Path therefore serves as a lightweight repair mode for small but operationally important inconsistencies, such as confidence miscalibration in DR or minor infeasibility in AS\.
WhenS<τ−ϵS<\\tau\-\\epsilon, STAR enters theSlow Path, which is reserved for traces with strong evidence of upstream contamination\. Such traces typically exhibit hard RCA violations, including missing anomaly onset in EP, insufficient modality coverage, severe hypothesis anchoring, or widespread topology/temporal conflicts in AS\. In this setting, local patching is unlikely to be sufficient, because the downstream artifacts have already been generated from a corrupted upstream state\. STAR therefore switches to replay\-validated stage localization\.
### IV\-CDecisive Stage Localization via Counterfactual Candidate Evaluation
The central challenge in STAR is to identify thedecisive faulty stage, i\.e\., the earliest stage whose correction can repair the trace after downstream replay\. This design directly follows the stage\-dependency contamination discussed in Sec\.[III](https://arxiv.org/html/2605.15581#S3): if an upstream artifact is already flawed, later stages may remain coherently wrong and cannot be fixed reliably in isolation\.
Figure 2:Main prompt of stage critic𝒞s\\mathcal\{C\}\_\{s\}\.For each stagess, STAR invokes a stage critic𝒞s\\mathcal\{C\}\_\{s\}\(main prompt is shown in Fig\.[2](https://arxiv.org/html/2605.15581#S4.F2)\) to proposeKKstructured patch candidates:
\{𝒫s\(k\)\}k=1K=𝒞s\(𝒜,ds;O,G\)\.\\\{\\mathcal\{P\}\_\{s\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}=\\mathcal\{C\}\_\{s\}\(\\mathcal\{A\},d\_\{s\};O,G\)\.\(8\)Here, patching is only used as a counterfactual probe for stage attribution; the selected patch is not committed until the decisive stage is identified\. Each candidate is then evaluated by applying patch and replay, followed by re\-audit:
ΔS\(s,k\)=S\(Replay\(𝒜,s,𝒫s\(k\)\)\)−S\(𝒜\)\.\\Delta S\(s,k\)=S\\\!\\left\(\\mathrm\{Replay\}\(\\mathcal\{A\},s,\\mathcal\{P\}\_\{s\}^\{\(k\)\}\)\\right\)\-S\(\\mathcal\{A\}\)\.\(9\)STAR searches stages in causal order and selects the earliest stage that yields a significant improvement:
s∗=min\{s\|maxkΔS\(s,k\)≥δ\},s^\{\*\}=\\min\\left\\\{s\\;\\middle\|\\;\\max\_\{k\}\\Delta S\(s,k\)\\geq\\delta\\right\\\},\(10\)whereδ\\deltais a minimum improvement margin\. This replay\-validated criterion makes stage attribution both testable and actionable: STAR does not merely identify a suspicious stage, but the earliest stage whose correction can actually restore trace consistency\.
To avoid wasting repair budget on downstream reasoning when the evidence itself is insufficient, STAR further introduces an exception\-driven rollback rule\. If the analysis stage cannot produce any topology\-feasible and telemetry\-supported causal chain under hard constraints, STAR triggers anInsufficientEvidenceExceptionand forces rollback to S1, thereby re\-entering evidence recollection rather than overfitting a flawed downstream explanation\.
### IV\-DRCA\-Specific Patch\-and\-Replay Repair
TABLE I:Summary of STAR patch operators across different RCA stages\. Each failure pattern is paired with its corresponding repair action, followed by replay of downstream stages to remove contamination\.StageArtifactTypical failure patternPatch operationReplay scopeS1Evidence Package \(EP\)Missed anomaly onsetShift or expand the incident windowReplay S2–S4Missing modality evidenceRe\-query the missing modalityVictim\-only scopeExpand to upstream/downstream neighborsMisaligned time windowsRealign telemetry timestampsS2Hypothesis Set \(HS\)Unsupported hypothesesRemove unsupported hypothesesReplay S3–S4Anchoring biasIntroduce alternative candidatesMissing counter\-hypothesesAdd counter\-hypothesesLack of cross\-layer candidatesAdd host–pod–service candidatesS3Analysis Structure \(AS\)Unreachable causal pathsRebuild a reachable causal chainReplay S4Hallucinated edgesPrune unsupported edgesTemporal order violationsRestore cause\-before\-effect orderingUnsupported causal linksAdd telemetry support for each linkS4Decision Report \(DR\)Overconfident rankingLower or recalibrate confidenceNo replayInconsistent top candidateAlign ranking with repaired analysisWeak verification testsReplace with discriminative testsMechanism\-irrelevant actionsMatch actions to the failure mechanismOnce the decisive stages∗s^\{\*\}is identified, STAR applies a stage\-local patch and replays downstream stages to obtain a repaired trace𝒜∗\\mathcal\{A\}^\{\*\}and decisionDR∗\\mathrm\{DR\}^\{\*\}\. Patch operators\(shown in Table\.[I](https://arxiv.org/html/2605.15581#S4.T1)\) are constrained to preserve artifact schemas and are designed around RCA\-specific primitives: telemetry queries, entity scopes, topology neighborhoods, and verification actions\.
ForEP, STAR performs evidence recollection under RCA heuristics, including time\-window adjustment to recover anomaly onset, expansion to upstream/downstream topology neighbors, rebalancing modality coverage, and re\-alignment across telemetry sources\. ForHS, STAR repairs hypothesis coverage by introducing counter\-hypotheses under ambiguous evidence, enforcing host–pod–service consistency, and requiring explicit evidence binding to prevent unsupported speculation\. ForAS, STAR reconstructs topology\-feasible, telemetry\-closed causal chains by pruning unreachable edges, restoring cause\-before\-effect order, and requiring each causal link to be supported by at least one evidence item\. ForDR, STAR calibrates confidence and improves actionability, preferring verification\-first outputs when ambiguity remains high and ensuring that recommended tests match the hypothesized failure mechanism\.
Replayis the key mechanism for removing RCA contamination\. After patching stages∗s^\{\*\}, STAR re\-executes all downstream stages conditioned on the corrected upstream artifact:
Replay\(𝒜,s,𝒫s\)=Run\(s\+1\)→4\(Replace\(𝒜,s,𝒫s\)\),\\mathrm\{Replay\}\(\\mathcal\{A\},s,\\mathcal\{P\}\_\{s\}\)=\\mathrm\{Run\}\_\{\(s\+1\)\\rightarrow 4\}\\Bigl\(\\mathrm\{Replace\}\(\\mathcal\{A\},s,\\mathcal\{P\}\_\{s\}\)\\Bigr\),\(11\)whereRun\(s\+1\)→4\\mathrm\{Run\}\_\{\(s\+1\)\\rightarrow 4\}denotes executing the base agent from stages\+1s\+1toS4S4with fixed upstream artifacts\. Concretely, patching S1 triggers replay of S2–S4, patching S2 triggers replay of S3–S4, patching S3 triggers replay of S4, and patching S4 requires no replay\. This ensures that downstream artifacts are regenerated from the corrected upstream state rather than being locally edited on top of a contaminated trace\.
### IV\-ESelf\-Evolving Repair Memory
To reduce repeated trial\-and\-error, STAR maintains a repair memory
ℳ=\{⟨𝐪i,si,𝒫i,ΔSi⟩\},\\mathcal\{M\}=\\left\\\{\\langle\\mathbf\{q\}\_\{i\},s\_\{i\},\\mathcal\{P\}\_\{i\},\\Delta S\_\{i\}\\rangle\\right\\\},\(12\)where𝐪i\\mathbf\{q\}\_\{i\}summarizes the incident pattern in RCA terms \(e\.g\., dominant anomaly type, modality availability, topology neighborhood statistics, and evidence sparsity\),sis\_\{i\}is the blamed stage,𝒫i\\mathcal\{P\}\_\{i\}is the patch template, andΔSi\\Delta S\_\{i\}is the achieved improvement\. This memory serves two purposes: it provides an attribution prior for recurring RCA failure patterns, and it seeds the stage critics with historically successful patch templates to improve repair efficiency\.
Overall, STAR iterates for at mostIIrounds\. It first runs the base agent to obtain𝒜\\mathcal\{A\}, audits the trace to compute\(S,𝐝\)\(S,\\mathbf\{d\}\), and either accepts the decision, performs a lightweight Fast\-Path patch, or enters Slow\-Path replay\-validated stage localization\. After patching the blamed stage and replaying downstream stages, STAR updates memory if the repair yields sufficient improvement\. If the repair budget is exhausted, STAR returns a conservative verification\-first output, such as top\-KKcandidates with discriminative tests, to avoid overconfident misdiagnosis\.
## VExperiments
### V\-AResearch Questions
- •RQ1: Does STAR improve end\-to\-end RCA performance compared with the base RCA agent baselines?
- •RQ2: Can STAR identify the decisive faulty stage \(S1–S4\) precisely?
- •RQ3: Given an initially incorrect trace, how many repair iterations/replays does STAR typically require to correct the diagnosis?
- •RQ4: How much does STAR benefit from its key components, particularlyFast/Slow RoutingandDecisive Stage Localization via Counterfactual Candidate Evaluation?
### V\-BExperiment Setup
#### V\-B1Dataset and Preprocessing
Dataset A is a large\-scale public benchmark released by the AIOps Challenge\[[25](https://arxiv.org/html/2605.15581#bib.bib25)\]\. It was constructed through controlled fault injection on a realistically deployed microservice system, HipsterShop2\. The benchmark provides multimodal observability data, including metrics, logs, and traces\. HipsterShop2 runs on a dynamic Kubernetes \(K8s\) cluster consisting of 10 services with 4 replicas per service, yielding 40 pods in total, which are dynamically scheduled across 6 nodes\. In total, Dataset A contains 15 fault types, including 9 service\-/pod\-level faults in the K8s container environment and 6 node\-level faults, such as abrupt memory pressure, disk space exhaustion, disk I/O anomalies, CPU contention, and progressive CPU slowdown\.
Dataset B is a real\-world production dataset collected from a Project Management Platform operated by an electric power information enterprise\. In contrast to Dataset A, the incidents in Dataset B were recorded under natural production conditions rather than generated through fault injection\. The platform consists of 12 microservices and 48 pods, and the dataset captures multimodal observability signals, including metrics, logs, and traces, during real incident scenarios\. The observed faults fall into five categories: CPU hog, memory leak, network delay, packet loss, and disk payload overload\.
#### V\-B2Baselines
To evaluate the generalization ability ofSTAR, we adapt it to two representative RCA agent systems,mABCandRCAgent\.
mABCis a multi\-agent RCA framework for microservice systems that organizes diagnosis through a predefined workflow and a set of specialized agents, including alert receiving, process scheduling, data collection, dependency exploration, probability estimation, fault mapping, and solution generation\. To improve the reliability of LLM\-based diagnosis, mABC further introduces a blockchain\-inspired voting mechanism, where multiple agents collaboratively verify intermediate answers and revise low\-quality outputs through weighted consensus\[[26](https://arxiv.org/html/2605.15581#bib.bib26)\]\. This design makes mABC a representative multi\-agent baseline for structured RCA with explicit coordination and reflection\.
RCAgentrepresents a lighter agentic RCA workflow that performs root cause reasoning by directly integrating observability evidence with LLM\-based stepwise diagnosis, without the stronger multi\-agent coordination and voting process used in mABC\. Compared with mABC, RCAgent provides a weaker but more streamlined reasoning baseline, which makes it suitable for testing whether STAR can also improve simpler agent pipelines\. In our implementation, both workflows are re\-expressed under the same LangGraph execution interface for fair comparison\.
#### V\-B3Evaluation Metrics
We evaluate the model from two aspects:root cause localizationandfault type classification\.
##### Root Cause Localization\.
We reportAcc@1,Acc@3, andAcc@5, which measure whether the ground\-truth root cause appears in the top\-1, top\-3, and top\-5 predicted candidates, respectively:
Acc@k=1N∑i=1N𝟏\(ri∈ℛ^i\(k\)\),k∈\{1,3,5\},\\mathrm\{Acc@\}k=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\\!\\left\(r\_\{i\}\\in\\hat\{\\mathcal\{R\}\}\_\{i\}^\{\(k\)\}\\right\),\\quad k\\in\\\{1,3,5\\\},\(13\)whereNNis the number of test incidents,rir\_\{i\}is the ground\-truth root cause, andℛ^i\(k\)\\hat\{\\mathcal\{R\}\}\_\{i\}^\{\(k\)\}denotes the top\-kkpredicted candidates\.
##### Fault Type Classification\.
For fault type prediction, we report micro\- and macro\-averaged precision, recall, and F1\-score, includingMiPr,MaPr,MiRe,MaRe,MiF1, andMaF1\. For each classcc, letTPcTP\_\{c\},FPcFP\_\{c\}, andFNcFN\_\{c\}denote the numbers of true positives, false positives, and false negatives, respectively\. Then
Pc\\displaystyle P\_\{c\}=TPcTPc\+FPc,Rc=TPcTPc\+FNc,\\displaystyle=\\frac\{TP\_\{c\}\}\{TP\_\{c\}\+FP\_\{c\}\},\\qquad R\_\{c\}=\\frac\{TP\_\{c\}\}\{TP\_\{c\}\+FN\_\{c\}\},\(14\)F1c\\displaystyle F1\_\{c\}=2PcRcPc\+Rc\.\\displaystyle=\\frac\{2P\_\{c\}R\_\{c\}\}\{P\_\{c\}\+R\_\{c\}\}\.The macro\-averaged metrics are defined as
MaPr\\displaystyle\\mathrm\{MaPr\}=1C∑c=1CPc,MaRe=1C∑c=1CRc,\\displaystyle=\\frac\{1\}\{C\}\\sum\_\{c=1\}^\{C\}P\_\{c\},\\qquad\\mathrm\{MaRe\}=\\frac\{1\}\{C\}\\sum\_\{c=1\}^\{C\}R\_\{c\},\(15\)MaF1\\displaystyle\\mathrm\{MaF1\}=1C∑c=1CF1c\.\\displaystyle=\\frac\{1\}\{C\}\\sum\_\{c=1\}^\{C\}F1\_\{c\}\.whereCCis the number of fault categories\. The micro\-averaged metrics are computed by aggregating statistics over all classes:
MiPr\\displaystyle\\mathrm\{MiPr\}=∑cTPc∑c\(TPc\+FPc\),MiRe=∑cTPc∑c\(TPc\+FNc\),\\displaystyle=\\frac\{\\sum\_\{c\}TP\_\{c\}\}\{\\sum\_\{c\}\(TP\_\{c\}\+FP\_\{c\}\)\},\\qquad\\mathrm\{MiRe\}=\\frac\{\\sum\_\{c\}TP\_\{c\}\}\{\\sum\_\{c\}\(TP\_\{c\}\+FN\_\{c\}\)\},\(16\)MiF1\\displaystyle\\mathrm\{MiF1\}=2⋅MiPr⋅MiReMiPr\+MiRe\.\\displaystyle=\\frac\{2\\cdot\\mathrm\{MiPr\}\\cdot\\mathrm\{MiRe\}\}\{\\mathrm\{MiPr\}\+\\mathrm\{MiRe\}\}\.
By analyzing the agent reasoning traces throughout the RCA process, we identify 13 distinct reasoning failure types spanning four stages\(shown in Table[III](https://arxiv.org/html/2605.15581#S5.T3)\)\. To evaluate the accuracy of STAR’s decisive stage localization, we adopt anLLM\-as\-a\-Judgeprotocol\[[15](https://arxiv.org/html/2605.15581#bib.bib15),[16](https://arxiv.org/html/2605.15581#bib.bib16),[24](https://arxiv.org/html/2605.15581#bib.bib24)\]\. Specifically,GPT\-5\.2serves as an independent evaluator for verifying whether STAR correctly identifies the faulty stage in an RCA trace\. Given the RCA reasoning trajectory and STAR’s predicted stage\-level audit result, the judge determines whether the predicted faulty stage is consistent with the trace evidence and with the semantics of the four\-stage RCA taxonomy \(EP/HS/AS/DR\)\. To reduce self\-enhancement bias, the judge model is kept different from the backbone models used in the evaluated RCA agents\. The evaluation prompt is carefully structured to include: \(1\) the definitions, failure criteria, and representative examples of the four RCA stages; \(2\) a decomposed reasoning procedure that asks the judge to first inspect the trace and then localize the inconsistency to one of the four stages; and \(3\) a fixed output format requiring a stage label, a correctness judgment on STAR’s audit, and a brief evidence\-grounded rationale\.
#### V\-B4Implementation Details\.
We re\-implement all baseline RCA agents under theLangGraphframework to support unified state management and controllable replay\. The main motivation is that LangGraph provides a native node\-level execution and replay mechanism, which is essential for our stage\-attributed correction setting\. By representing the RCA workflow as a structured execution graph, we can explicitly track intermediate stage artifacts, replay selected nodes, and analyze how local stage repairs affect downstream reasoning outcomes\.
To ensure a fair comparison and eliminate the influence of a specific backbone model, each workflow is instantiated with three foundation models:GPT\-5,Qwen3\-Max, andGemini\-2\.5\-Pro\. In other words, all workflows are evaluated under the same set of LLM backbones, and the reported results reflect the combined effects of workflow design and stage\-level correction rather than the advantage of any single model\. Unless otherwise specified, all other implementation settings are kept identical across workflows\. All experiments are conducted on a server equipped with an NVIDIA A100 80GB GPU and 256GB RAM\.
TABLE II:Main results on root cause localization and fault type classification across different datasets, workflows, and foundation models\.DatasetWorkflowModelRoot Cause LocalizationFault Types ClassificationAcc@1Acc@3Acc@5MiPrMaPrMiReMaReMiF1MaF1AAmABCGPT\-537\.20%45\.10%47\.80%0\.48700\.45800\.53600\.49700\.51030\.4767Gemini\-2\.5pro35\.40%43\.30%46\.00%0\.47200\.44600\.51800\.48700\.49390\.4656Qwen3\-max36\.20%44\.10%46\.80%0\.48400\.45400\.52400\.48700\.50320\.4699mABC\+STARGPT\-556\.20%66\.10%70\.80%0\.69400\.66400\.74900\.71100\.72040\.6867Gemini\-2\.5pro53\.70%63\.80%68\.40%0\.67600\.65100\.73100\.69900\.70240\.6741Qwen3\-max54\.90%65\.00%69\.60%0\.68900\.66300\.74200\.72100\.71450\.6908RCAgentGPT\-523\.80%30\.10%32\.60%0\.34700\.32000\.40900\.36400\.37550\.3406Gemini\-2\.5pro22\.00%28\.30%30\.80%0\.33200\.30800\.40100\.35400\.36330\.3294Qwen3\-max22\.80%29\.10%31\.60%0\.34400\.31600\.39700\.35400\.36860\.3339RCAgent\+STARGPT\-548\.70%58\.40%62\.50%0\.62600\.59300\.68800\.64200\.65550\.6165Gemini\-2\.5pro46\.10%55\.80%59\.90%0\.60800\.57700\.66700\.62600\.63610\.6008Qwen3\-max47\.40%57\.10%61\.20%0\.62100\.58900\.68100\.64500\.64960\.6157BBmABCGPT\-538\.70%47\.00%51\.10%0\.54200\.51500\.58900\.55400\.56450\.5338Gemini\-2\.5pro36\.90%45\.20%49\.30%0\.52700\.50300\.57100\.54400\.54810\.5227Qwen3\-max37\.70%46\.00%50\.10%0\.53900\.51100\.57700\.54400\.55740\.5270mABC\+STARGPT\-558\.20%68\.60%73\.90%0\.74200\.71200\.79100\.75600\.76570\.7333Gemini\-2\.5pro55\.60%65\.90%71\.20%0\.72400\.69500\.77200\.73900\.74720\.7163Qwen3\-max57\.00%67\.30%72\.60%0\.73700\.70600\.78400\.76400\.75980\.7339RCAgentGPT\-526\.60%33\.40%36\.90%0\.40600\.37800\.45900\.42600\.43090\.4006Gemini\-2\.5pro24\.80%31\.60%35\.10%0\.39100\.36600\.44100\.41600\.41450\.3894Qwen3\-max25\.60%32\.40%35\.90%0\.40300\.37400\.44700\.41600\.42390\.3939RCAgent\+STARGPT\-550\.10%60\.70%65\.20%0\.67400\.64500\.72300\.68900\.69760\.6663Gemini\-2\.5pro47\.50%57\.90%62\.30%0\.65500\.62800\.70100\.67200\.67720\.6493Qwen3\-max48\.90%59\.40%63\.90%0\.66800\.64100\.71500\.69100\.69070\.6652
TABLE III:Stage\-level reasoning fault taxonomy used in STAR auditing and LLM\-as\-a\-Judge evaluation\.NameDescriptionStageFabricated evidenceThe agent cites non\-existent alerts, metrics, logs, traces, or tool outputs\.Evidence PackageEvidence misreadingThe agent misinterprets evidence semantics, such as metric trends, log meaning\.Source confusionThe agent confuses the symptom\-observing component with the true fault source\.Biased evidence selectionThe agent overlooks more diagnostic clues and selects evidence unsystematically\.Premature anchoringThe agent fixates too early on one candidate and ignores alternatives\.Hypothesis setOver\-specific hypothesisThe agent proposes an overly specific fault hypothesis without enough evidence\.Missing hypothesesThe agent fails to consider plausible alternative causes or fault types\.Temporal–causal mismatchThe inferred causal chain violates event order or expected propagation\.Analysis StructureUnsupported causal leapThe agent asserts causal links not supported by evidence or topology\.Insufficient verificationThe conclusion is maintained with weak, indirect, or insufficient evidence\.Belief update failureThe agent fails to revise its analysis after contradictory evidence appears\.Unstable conclusionThe final diagnosis contradicts itself or conflicts with prior reasoning\.Decision ReportNon\-convergent reportingThe agent fails to reach a decisive RCA result and instead repeats or loops\.
### V\-CRQ1: Does STAR improve end\-to\-end RCA performance compared with the base RCA agent baselines?
Table[II](https://arxiv.org/html/2605.15581#S5.T2)reports the overall performance of different workflows and foundation models on two datasets, covering bothroot cause localizationandfault type classification\.
##### Overall comparison\.
Across both Dataset A and Dataset B, the proposed STAR\-enhanced workflows consistently outperform their corresponding baselines\. In particular,mABC\+STARachieves the best overall performance on nearly all localization metrics and most classification metrics, whileRCAgent\+STARconsistently ranks second, followed by the originalmABCandRCAgent\. This ordering is stable across different foundation models, demonstrating that the gains mainly come from the stage\-attributed correction and replay mechanism rather than from a particular backbone model\.
##### Root cause localization performance
STAR brings substantial improvements to root cause localization on both datasets\. On Dataset A, with GPT\-5 as the backbone, mABC improves from 37\.2%/45\.1%/47\.8% to 56\.2%/66\.1%/70\.8% in Acc@1/3/5 after incorporating STAR, corresponding to absolute gains of \+19\.0, \+21\.0, and \+23\.0 points, respectively\. Similarly, RCAgent improves from 23\.8%/30\.1%/32\.6% to 48\.7%/58\.4%/62\.5%, yielding even larger gains of \+24\.9, \+28\.3, and \+29\.9 points\. A similar trend is observed on Dataset B\. For example, mABC\+STAR with GPT\-5 reaches 58\.2%, 68\.6%, and 73\.9% on Acc@1/3/5, compared with 38\.7%, 47\.0%, and 51\.1% for the original mABC\. These results indicate that STAR substantially improves the agent’s ability to recover the true root cause within both top\-1 and top\-kkpredictions\.
##### Fault type classification performance\.
The performance gains are equally evident in fault type classification\. On Dataset A, mABC\+STAR \(GPT\-5\) raises MiF1/MaF1 from 0\.5103/0\.4767 to 0\.7204/0\.6867, while RCAgent\+STAR improves from 0\.3755/0\.3406 to 0\.6555/0\.6165\. On Dataset B, the corresponding gains are from 0\.5645/0\.5338 to 0\.7657/0\.7333 for mABC, and from 0\.4309/0\.4006 to 0\.6976/0\.6663 for RCAgent\. These improvements suggest that STAR not only helps identify the faulty service more accurately, but also produces more discriminative intermediate reasoning patterns that benefit downstream fault category recognition\.
##### Effect of STAR on different base workflows\.
An interesting observation is that STAR yields larger relative gains on the weaker baseline, RCAgent, than on mABC\. This trend is consistent on both datasets and across all three models\. We attribute this to the fact that weaker baselines are more prone to stage\-level reasoning errors, making them benefit more from stage attribution, targeted correction, and replay\. In contrast, mABC already provides a stronger initial reasoning trajectory, so STAR mainly serves as a refinement mechanism, further improving robustness and top\-kklocalization\.
##### Comparison across foundation models\.
Regarding the foundation models,GPT\-5achieves the strongest overall performance in most settings, especially on Acc@1/3/5 and micro\-averaged classification metrics\. However,Qwen3\-maxis highly competitive and surpasses GPT\-5 on several macro\-level metrics, such as MaRe and MaF1 in the STAR\-enhanced setting on both datasets\.Gemini2\.5progenerally performs slightly below the other two models, but still benefits significantly from STAR\. This pattern indicates that while stronger reasoning backbones improve the absolute ceiling, the effectiveness of STAR is robust across diverse LLM families\.
##### Cross\-dataset observation\.
The same ranking trend holds for both datasets, but absolute performance on Dataset B is generally higher than on Dataset A\. Since Dataset B contains real production incidents from a fixed enterprise platform, its fault patterns may exhibit relatively stronger operational regularity than the large\-scale public benchmark in Dataset A\. Nevertheless, the consistent improvements across both datasets demonstrate the generality of STAR under both controlled fault injection and real\-world incident environments\.
### V\-DRQ2: Can STAR precisely identify the decisive faulty stage?
\(a\)Comparison Across Microservice Fault Types
\(b\)Comparison Across Reasoning Fault Stages
Figure 3:Comparison Across Microservice Fault Stages and Fault TypesTo answer RQ2, we compare STAR with four adapted failure\-attribution baselines inspired by prior work on automated failure attribution for LLM multi\-agent systems\[[29](https://arxiv.org/html/2605.15581#bib.bib29)\]:All\-at\-once,Step\-by\-step,Binary Search, and aHybridvariant\. That work formulates failure attribution as identifying the failure\-responsible agent and the decisive error step, and shows that all\-at\-once is stronger for agent\-level attribution, while step\-by\-step is generally better for precise step\-level localization, with binary search lying in between; a hybrid strategy further improves step\-level prediction at higher computational cost\. Following this idea, we adapt these methods to our RCA setting by asking each method to directly predict the decisive faulty stage amongS1S\_\{1\}–S4S\_\{4\}, rather than the exact turn index in a generic multi\-agent log\.
To answer RQ2, we evaluate decisive\-stage localization from two perspectives:microservice fault typeandreasoning fault stage\. Specifically, we consider four representative service faults, includingDisk Space Exhaustion,Sudden Memory Pressure,Network Delay, andCPU Stress, and also inject reasoning faults into each stageS1S\_\{1\}–S4S\_\{4\}\. The metric isstage localization accuracy, i\.e\., the proportion of cases where the predicted decisive stage matches the ground\-truth stage\.
As shown in Fig\.[3\(a\)](https://arxiv.org/html/2605.15581#S5.F3.sf1), STAR consistently outperforms all adapted baselines across all four service fault types, achieving 88\.4%, 84\.7%, 79\.6%, and 82\.9% accuracy, respectively\. Among the baselines, Hybrid performs best overall, followed by Step\-by\-step, Binary Search, and All\-at\-once\. This indicates that STAR benefits from explicitly modeling RCA traces as stage\-structured artifacts rather than treating them as generic long\-form interaction logs\.
Fig\.[3\(b\)](https://arxiv.org/html/2605.15581#S5.F3.sf2)further reports the results across different reasoning fault stages\. STAR achieves 89\.1%, 82\.3%, 78\.7%, and 86\.4% onS1S\_\{1\}–S4S\_\{4\}, outperforming the strongest baseline by 10\+ points on all stages\. The same difficulty pattern is observed across methods:S1S\_\{1\}andS4S\_\{4\}are easier to localize, whileS2S\_\{2\}and especiallyS3S\_\{3\}are harder, since hypothesis drift and analysis errors are more entangled with intermediate reasoning\. Overall, these results show that STAR can reliably identify the decisive faulty stage and that stage\-structured attribution is more effective than directly adapting generic failure\-attribution methods to RCA\.
### V\-ERQ3: How many repair iterations are typically needed before STAR reaches a corrected diagnosis?
Figure 4:Counterfactual repair iteration distribution when STAR is applied to initially incorrect traces generated by the baseline agents\.To answer RQ3, we analyze the repair efficiency of STAR on initially incorrect RCA traces under two workflows,mABC\+STARandRCAgent\+STAR, on both Dataset A and Dataset B\. The statistics in RQ3 are computed onincorrect traces originally generated by the baseline agents without repair; the reported replay counts indicate how many STAR repair rounds would be neededif STAR were applied afterward\. For each erroneous trace, STAR iteratively performs decisive\-stage localization, stage patching, and downstream replay until the diagnosis is corrected or the replay budget is exhausted\. We set the maximum number of repair iterations to 3 and report the proportions of cases corrected after the first, second, and third replay, together with the unresolved cases\.
Figure[4](https://arxiv.org/html/2605.15581#S5.F4)shows that most incorrect traces are repaired within the first two replay rounds across all settings\. On Dataset A, mABC\+STAR corrects 61\.8% of the erroneous traces after a single replay, compared with 54\.2% for RCAgent\+STAR\. After two replays, the cumulative correction ratios further increase substantially, while only 7\.2% and 8\.4% of the cases remain unresolved, respectively\. A similar trend is observed on Dataset B, where mABC\+STAR fixes 65\.1% of the cases in the first replay and RCAgent\+STAR fixes 57\.4%, with unresolved cases remaining below 10% for both workflows\.
Overall, these results indicate that STAR usually requires only a small number of targeted replays to recover a correct diagnosis\. Moreover, the stronger base workflow \(mABC\) benefits more from one\-shot repair, whereas RCAgent\+STAR more often relies on a second replay, suggesting that better initial reasoning trajectories make stage\-level patching more immediately effective\.
### V\-FRQ4: How much does STAR benefit from its key components, particularly Fast/Slow Routing and Decisive Stage Localization via Counterfactual Candidate Evaluation?
To answer RQ4, we conduct ablation studies on the two key components of STAR:Fast/Slow RoutingandDecisive Stage Localization via Counterfactual Candidate Evaluation\. To isolate component\-level effects from backbone variation, we instantiate bothmABC\+STARandRCAgent\+STARwith GPT\-5 in this subsection\. ForFast/Slow Routing, we evaluate both downstream RCA quality and repair efficiency usingAcc@1,Acc@3, and the average number of repair iterations on initially incorrect baseline traces\. ForDecisive Stage Localization via Counterfactual Candidate Evaluation, we measure both the average decisive\-stage localization accuracy and the downstream RCA performance \(Acc@1andAcc@3\)\.
Table[IV](https://arxiv.org/html/2605.15581#S5.T4)shows thatFast/Slow Routingconsistently improves both repair efficiency and diagnosis quality\. Removing this module increases the average repair iterations and reduces Acc@1/Acc@3 across all datasets and workflows\. For example, on Dataset A with mABC\+STAR, the average repair iterations decrease from 1\.78 to 1\.41, while Acc@1/Acc@3 improve from 52\.4%/62\.8% to 56\.2%/66\.1%\. This indicates that Fast/Slow Routing is not merely a computational optimization, but also improves repair quality by matching correction depth to trace difficulty\.
Table[V](https://arxiv.org/html/2605.15581#S5.T5)shows that removingDecisive Stage Localization via Counterfactual Candidate Evaluationsubstantially degrades both stage localization accuracy and downstream RCA performance\. For instance, on Dataset B with mABC\+STAR, stage localization accuracy drops from 88\.6% to 80\.3%, while Acc@1/Acc@3 decrease from 58\.2%/68\.6% to 54\.8%/64\.7%\. Overall, the ablation results confirm that Fast/Slow Routing mainly improves repair efficiency, whereas Counterfactual Candidate Evaluation is crucial for accurate stage attribution and stronger root cause localization\.
TABLE IV:Ablation onFast/Slow Routing\. Lower Avg\. Iters is better\.DataWFVariantAvg\. Iters↓\\downarrowAcc@1 / Acc@3AAmABC\+Sw/o F/S1\.7852\.4 / 62\.8Full1\.4156\.2 / 66\.1RCA\+Sw/o F/S1\.8644\.6 / 54\.3Full1\.5148\.7 / 58\.4BBmABC\+Sw/o F/S1\.6954\.3 / 64\.5Full1\.3758\.2 / 68\.6RCA\+Sw/o F/S1\.8046\.2 / 56\.8Full1\.4750\.1 / 60\.7TABLE V:Ablation onCounterfactual Candidate Evaluation\.DataWFVariantStage Acc\.Acc@1 / Acc@3AAmABC\+Sw/o CCE78\.450\.7 / 61\.9Full86\.956\.2 / 66\.1RCA\+Sw/o CCE75\.142\.3 / 49\.6Full83\.748\.7 / 58\.4BBmABC\+Sw/o CCE80\.354\.8 / 60\.2Full88\.658\.2 / 68\.6RCA\+Sw/o CCE77\.644\.1 / 49\.9Full85\.950\.1 / 60\.7
## VIRelated Works
### VI\-ARCA in Microservices
Root cause analysis \(RCA\) in microservices has evolved from correlation\- and graph\-based methods to multimodal learning over metrics, logs, traces, and topology\. Early efforts such as CauseInfer, FacGraph, and MS\-Rank explored dependency\-aware and correlation\-based diagnosis\[[1](https://arxiv.org/html/2605.15581#bib.bib1),[2](https://arxiv.org/html/2605.15581#bib.bib2),[3](https://arxiv.org/html/2605.15581#bib.bib3)\], while more recent systems such as MicroHECL, Eadro, Nezha, MRCA, and Trace\-based RCA improve localization robustness by jointly modeling observability signals and service dependencies\[[5](https://arxiv.org/html/2605.15581#bib.bib5),[7](https://arxiv.org/html/2605.15581#bib.bib7),[8](https://arxiv.org/html/2605.15581#bib.bib8),[19](https://arxiv.org/html/2605.15581#bib.bib19),[20](https://arxiv.org/html/2605.15581#bib.bib20),[6](https://arxiv.org/html/2605.15581#bib.bib6)\]\. Other recent studies further investigate operation\-aware, event\-causal, and sparse\-observability RCA settings\[[21](https://arxiv.org/html/2605.15581#bib.bib21),[9](https://arxiv.org/html/2605.15581#bib.bib9),[22](https://arxiv.org/html/2605.15581#bib.bib22),[23](https://arxiv.org/html/2605.15581#bib.bib23)\]\.
### VI\-BLLM\-based RCA Agents
Recent work has begun to cast RCA as an agentic reasoning task using LLMs, where agents collect evidence, generate hypotheses, and produce diagnostic decisions\[[26](https://arxiv.org/html/2605.15581#bib.bib26),[28](https://arxiv.org/html/2605.15581#bib.bib28)\]\. More generally, multi\-agent LLM frameworks such as ReAct and AutoGen provide a foundation for structured tool use, staged reasoning, and agent collaboration in complex operational tasks\[[10](https://arxiv.org/html/2605.15581#bib.bib10),[13](https://arxiv.org/html/2605.15581#bib.bib13)\]\. Yet such systems remain vulnerable to reasoning failures such as evidence omission, hypothesis drift, and decision inconsistency, while lacking an explicit mechanism to localize and repair the faulty reasoning stage\.
## VIIConclusion
This paper presentedSTAR, a stage\-attributed correction framework for repairing erroneous reasoning traces in LLM\-based RCA agents\. Instead of treating RCA failure as a monolithic end\-to\-end error, STAR decomposes the diagnostic process into four structured stages and performsdecisive stage localization,stage\-specific patching, andtargeted replay\. Built on top of LangGraph, STAR leverages node\-level replay and structured stage artifacts to support controllable and efficient repair\.
Experiments on both a public large\-scale benchmark and a real\-world production dataset demonstrate that STAR consistently improves root cause localization and fault type classification across different workflows and backbone models\. Our results further show that STAR can identify the decisive faulty stage with high precision, repair most initially incorrect traces within one or two replay rounds, and benefit significantly from both Fast/Slow Routing and counterfactual candidate evaluation\. These findings suggest that explicitly modelingwherean RCA agent fails is an effective path toward reliable and repairable agentic diagnosis\.
Future work includes extending stage attribution from four coarse stages to finer\-grained sub\-stage or tool\-level repair, making replay policies more adaptive by jointly optimizing repair quality, latency, and token cost, and generalizing STAR beyond microservice RCA to other structured agent workflows involving evidence gathering, hypothesis formation, analysis, and decision making\.
## References
- \[1\]Chen P, Qi Y, Zheng P, et al\. Causeinfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems\[C\]\. IEEE INFOCOM 2014\-IEEE Conference on Computer Communications\. IEEE, 2014: 1887\-1895\.
- \[2\]Lin W, Ma M, Pan D, et al\. Facgraph: Frequent anomaly correlation graph mining for root cause diagnose in micro\-service architecture\[C\]\. 2018 IEEE 37th International Performance Computing and Communications Conference \(IPCCC\)\. IEEE, 2018: 1\-8\.
- \[3\]Ma M, Lin W, Pan D, et al\. Ms\-rank: Multi\-metric and self\-adaptive root cause diagnosis for microservice applications\[C\]\. 2019 IEEE International Conference on Web Services \(ICWS\)\. IEEE, 2019: 60\-67\.
- \[4\]Ma M, Yin Z, Zhang S, et al\. Diagnosing root causes of intermittent slow queries in cloud databases\[J\]\. Proceedings of the VLDB Endowment, 2020, 13\(8\): 1176\-1189\.
- \[5\]Liu D, He C, Peng X, et al\. Microhecl: High\-efficient root cause localization in large\-scale microservice systems\[C\]\. 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice \(ICSE\-SEIP\)\. IEEE, 2021: 338\-347\.
- \[6\]Li Z, Chen J, Jiao R, et al\. Practical root cause localization for microservice systems via trace analysis\[C\]\. 2021 IEEE/ACM 29th International Symposium on Quality of Service \(IWQOS\)\. IEEE, 2021: 1\-10\.
- \[7\]Lee C, Yang T, Chen Z, et al\. Eadro: An end\-to\-end troubleshooting framework for microservices on multi\-source data\. In 2023 IEEE/ACM 45th International Conference on Software Engineering \(ICSE\)\[J\]\. IEEE, Los Alamitos, CA, 1750, 1762\.
- \[8\]Yu G, Chen P, Li Y, et al\. Nezha: Interpretable fine\-grained root causes analysis for microservices on multi\-modal observability data\[C\]\. Proceedings of the 31st ACM joint European software engineering conference and symposium on the foundations of software engineering\. 2023: 553\-565\.
- \[9\]Yang J, Guo Y, Chen Y, et al\. TraceNet: Operation aware root cause localization of microservice system anomalies\[C\]\. 2023 IEEE International Conference on Communications Workshops \(ICC Workshops\)\. IEEE, 2023: 758\-763\.
- \[10\]Yao S, Zhao J, Yu D, et al\. React: Synergizing reasoning and acting in language models\[C\]\. The eleventh international conference on learning representations\. 2022\.
- \[11\]Shinn N, Cassano F, Gopinath A, et al\. Reflexion: Language agents with verbal reinforcement learning\[J\]\. Advances in neural information processing systems, 2023, 36: 8634\-8652\.
- \[12\]Madaan A, Tandon N, Gupta P, et al\. Self\-refine: Iterative refinement with self\-feedback\[J\]\. Advances in neural information processing systems, 2023, 36: 46534\-46594\.
- \[13\]Wu Q, Bansal G, Zhang J, et al\. Autogen: Enabling next\-gen LLM applications via multi\-agent conversations\[C\]\. First conference on language modeling\. 2024\.
- \[14\]Yao S, Yu D, Zhao J, et al\. Tree of thoughts: Deliberate problem solving with large language models\[J\]\. Advances in neural information processing systems, 2023, 36: 11809\-11822\.
- \[15\]Liu Y, Iter D, Xu Y, et al\. G\-eval: NLG evaluation using gpt\-4 with better human alignment\[C\]\. Proceedings of the 2023 conference on empirical methods in natural language processing\. 2023: 2511\-2522\.
- \[16\]Zheng L, Chiang W L, Sheng Y, et al\. Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\[J\]\. Advances in neural information processing systems, 2023, 36: 46595\-46623\.
- \[17\]Achiam J, Adler S, Agarwal S, et al\. Gpt\-4 technical report\[J\]\. arXiv preprint arXiv:2303\.08774, 2023\.
- \[18\]Hong S, Zhuge M, Chen J, et al\. MetaGPT: Meta programming for a multi\-agent collaborative framework\[C\]\. The twelfth international conference on learning representations\. 2023\.
- \[19\]Wang Y, Zhu Z, Fu Q, et al\. Mrca: Metric\-level root cause analysis for microservices via multi\-modal data\[C\]\. Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering\. 2024: 1057\-1068\.
- \[20\]Zhang C, Dong Z, Peng X, et al\. Trace\-based multi\-dimensional root cause localization of performance issues in microservice systems\[C\]\. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering\. 2024: 1\-12\.
- \[21\]Yang J, Guo Y, Chen Y, et al\. Micronet: Operation aware root cause identification of microservice system anomalies\[J\]\. IEEE Transactions on Network and Service Management, 2024, 21\(4\): 4255\-4267\.
- \[22\]Yao Z, Pei C, Chen W, et al\. Chain\-of\-event: Interpretable root cause analysis for microservices through automatically learning weighted event causal graph\[C\]\. Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering\. 2024: 50\-61\.
- \[23\]Yao Z, Ye H, Pei C, et al\. Sparserca: Unsupervised root cause analysis in sparse microservice testing traces\[C\]\. 2024 IEEE 35th International Symposium on Software Reliability Engineering \(ISSRE\)\. IEEE, 2024: 391\-402\.
- \[24\]Wang P, Li L, Chen L, et al\. Large language models are not fair evaluators\[C\]\. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)\. 2024: 9440\-9450\.
- \[25\]Sun Y, Wang J, Li Z, et al\. A Scenario\-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management\[J\]\. arXiv preprint arXiv:2407\.14532, 2024\.
- \[26\]Zhang W, Guo H, Yang J, et al\. mABC: multi\-Agent Blockchain\-Inspired Collaboration for root cause analysis in micro\-services architecture\[C\]\. Findings of the Association for Computational Linguistics: EMNLP 2024\. 2024: 4017\-4033\.
- \[27\]Sigelman B H, Barroso L A, Burrows M, et al\. Dapper, a large\-scale distributed systems tracing infrastructure\[J\]\. 2010\.
- \[28\]Pei C, Wang Z, Liu F, et al\. Flow\-of\-action: Sop enhanced llm\-based multi\-agent system for root cause analysis\[C\]\. Companion Proceedings of the ACM on Web Conference 2025\. 2025: 422\-431\.
- \[29\]Zhang S, Yin M, Zhang J, et al\. Which agent causes task failures and when? on automated failure attribution of llm multi\-agent systems\[J\]\. arXiv preprint arXiv:2505\.00212, 2025\.Similar Articles
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
STAR-Teaming introduces a multiplex-network-driven multi-agent framework that automates LLM red-teaming, achieving higher attack success rates with lower compute by organizing attack strategies into interpretable semantic communities.
StableRCA: Robust Graph-Agnostic Mechanism-Level Root Cause Analysis
StableRCA is a novel root cause analysis framework that identifies intervention targets by estimating local Markov boundaries and detecting conditional distribution shifts, avoiding the need for global causal graph discovery and demonstrating robustness across synthetic and real-world datasets.
DART: Semantic Recoverability for Structured Tool Agents
DART introduces semantic recoverability for structured tool agents, formalizing a criterion to determine whether a local checkpoint restore remains valid after downstream commitments. Experiments across three LLM-driven domains show it correctly recovers all commitment-sensitive cases where baseline local recovery fails, and a safety audit finds no unsafe rollbacks.
Building a Self-Healing Agent with MCP and Observability
A demo of a self-healing agent that uses observability (Monocle) and MCP to debug and fix a broken application by inspecting telemetry data and running tests, treating observability as part of the agent loop.
Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models
Proposes the Pseudocode-guided Structured Reasoning framework (PStar) that adaptively selects structured pseudocode reasoning paths to reduce hallucinations in Vision-Language Models, achieving state-of-the-art scores on POPE and MMStar benchmarks.