Why Retrieval-Augmented Generation Fails: A Graph Perspective

arXiv cs.CL Papers

Summary

This paper investigates why Retrieval-Augmented Generation (RAG) systems fail despite having access to correct evidence. Using circuit tracing and attribution graphs, the authors find that correct predictions exhibit deeper reasoning paths and more distributed evidence flow, while failures show shallow and fragmented patterns. They propose a graph-based error detection framework and targeted interventions to improve RAG reliability.

arXiv:2605.14192v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has become a powerful and widely used approach for improving large language models by grounding generation in retrieved evidence. However, RAG systems still produce incorrect answers in many cases. Why RAG fails despite having access to external information remains poorly understood. We present a model-internal study of retrieval-augmented generation that examines how retrieved evidence influences answer generation. Using circuit tracing, we construct attribution graphs that model the flow of information through transformer layers during decoding. These graphs represent interactions among retrieved context, intermediate model activations, and generated tokens, providing a graph, circuit-level view of how external evidence is integrated into the model's reasoning process across multiple question answering benchmarks, we observe consistent structural differences: correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured pattern of local connectivity, while failed predictions show shallower, fragmented, and overly concentrated evidence flow. Building on these findings, we develop a graph-based error detection framework that uses attribution-graph topology features. Furthermore, we show that attribution graphs enable targeted interventions. By reinforcing question-constrained evidence grounding, we reshape internal routing so that answer generation remains guided by the question, leading to more effective integration of retrieved information and fewer errors.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:20 AM

# Why Retrieval-Augmented Generation Fails: A Graph Perspective
Source: [https://arxiv.org/html/2605.14192](https://arxiv.org/html/2605.14192)
\(5 June 2009\)

###### Abstract\.

Retrieval\-Augmented Generation \(RAG\) has become a powerful and widely used approach for improving large language models by grounding generation in retrieved evidence\. However, RAG systems still produce incorrect answers in many cases\. Why RAG fails despite having access to external information remains poorly understood\. We present a model\-internal study of retrieval\-augmented generation that examines how retrieved evidence influences answer generation\. Using circuit tracing, we construct attribution graphs that model the flow of information through transformer layers during decoding\. These graphs represent interactions among retrieved context, intermediate model activations, and generated tokens, providing a graph, circuit\-level view of how external evidence is integrated into the model’s reasoning process across multiple question answering benchmarks, we observe consistent structural differences: correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured pattern of local connectivity, while failed predictions show shallower, fragmented, and overly concentrated evidence flow\. Building on these findings, we develop a graph\-based error detection framework that uses attribution\-graph topology features\. Furthermore, we show that attribution graphs enable targeted interventions\. By reinforcing question\-constrained evidence grounding, we reshape internal routing so that answer generation remains guided by the question, leading to more effective integration of retrieved information and fewer errors\.

Retrieval\-Augmented Generation, Attribution Graph, Large Language Model

††copyright:acmlicensed††journalyear:2018††doi:XXXXXXX\.XXXXXXX††conference:Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NY††isbn:978\-1\-4503\-XXXX\-X/2018/06††ccs:Computing methodologies Neural networks## 1\.Introduction

Retrieval\-Augmented Generation \(RAG\) has become a central paradigm for improving large language models by grounding generation in external evidence\(Lewiset al\.,[2020](https://arxiv.org/html/2605.14192#bib.bib12); Gaoet al\.,[2023](https://arxiv.org/html/2605.14192#bib.bib13); Hanet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib14); Chenet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib47); Zhenget al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib48); Suet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib49)\)\. By retrieving relevant documents at inference time and conditioning the model on this information, RAG systems aim to reduce incorrect predictions and improve factual reliability\(Ayala and Bechard,[2024](https://arxiv.org/html/2605.14192#bib.bib15); Huet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib16); Niuet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib17); Penget al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib50); Asaiet al\.,[2023](https://arxiv.org/html/2605.14192#bib.bib53)\)\. Despite these advantages, incorrect outputs remain common even when the retrieved passages contain the necessary evidence\. This suggests that the presence of evidence alone does not guarantee that it is faithfully integrated into the model’s reasoning process\(Guoet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib18); Guptaet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib20); Zhouet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib21); Wanget al\.,[2023](https://arxiv.org/html/2605.14192#bib.bib51); Shaoet al\.,[2023](https://arxiv.org/html/2605.14192#bib.bib52)\)\.

Existing work to investigate RAG failures focuses primarily on retrieval quality or consistency at the output\-level\(Trivediet al\.,[2023](https://arxiv.org/html/2605.14192#bib.bib22); Edgeet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib23)\)\. Some methods improve retrievers or re\-rank retrieved documents, while others detect errors using answer–document overlap or model confidence\(Yuet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib24); Leeet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib25); Wuet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib26)\)\. Although these approaches provide useful diagnostic indicators, they offer limited insight into the model\-internal reasoning dynamics that lead to unfaithful generation\. Recent studies have explored hidden\-state representations as diagnostic signals for knowledge checking\(Zenget al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib19)\)\. However, such approaches typically rely on representations from a single layer and only provide a largely static view of the model’s internal state\(Liuet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib54)\)\. As a result, they do not characterize how the retrieved evidence is propagated, transformed, and combined across layers during decoding\. This highlights the need for a methodological framework that explicitly captures internal evidence flow, enabling a granular understanding of knowledge aggregation\.

In this work, we take a graph perspective on RAG reasoning\. Instead of examining only inputs and outputs, we analyze how retrieved evidence propagates through the model during decoding\. We utilize the circuit tracing technique\(Ameisenet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib27)\)to build topological features to quantify how context tokens influence intermediate activations and final answer tokens\. We then translate these attribution signals into attribution graphs, which represent information flow among retrieved tokens, intermediate components, and generated outputs\. This graph\-based representation enables us to perform direct structural analysis of reasoning processes across examples\. Therefore we conduct a systematic study of both correct and incorrect RAG predictions\. We observe consistent structural differences across datasets\. Correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured local connectivity\. In contrast, incorrect predictions show shallower, fragmented, and overly concentrated evidence flow\.

To further provide a clear explanation of why failures occur, we focus on a mixed\-context setting in which retrieved passages contain both supporting and distracting information\. This scenario is particularly diagnostic, as successful reasoning requires selectively integrating the truly relevant evidence rather than relying on superficial question–context overlap\. Tracing internal information flow under this condition reveals a recurring failure mode that we termsurface\-aligned evidence grounding \(SAEG\): evidence only superficially matches the question but lacks deep understanding of the question and sustained influence from it, while generation becomes increasingly dominated by retrieved context\. In contrast, correct predictions often exhibitquestion\-constrained evidence grounding \(QCEG\), where the model places stronger emphasis on understanding the question and retrieved evidence remains consistently regulated by the question’s semantic constraints, forming deeper and more integrated reasoning structures\.

Overall, our study establishes attribution\-graph structure as a practical and interpretable lens for understanding evidence\-grounding failures in RAG systems\. Building on the above insights, we develop model\-internal error detection methods and targeted inference\-time interventions that directly regulate internal routing dynamics\. These approaches not only detect incorrect predictions but can also steer some failures toward correct outcomes, demonstrating the practical utility of our mechanistic understanding\. Our main contributions are summarized as follows\.

- •We use circuit tracing to derive attribution graphs for RAG models, enabling a graph\-based analysis of evidence propagation and influence\.
- •We identify consistent structural differences between correct and incorrect predictions, showing that many RAG errors stem from insufficient question understanding and over\-reliance on retrieved context\.
- •We develop a graph\-based error detection framework that operates purely on internal model dynamics\.
- •We demonstrate that attribution\-graph analysis enables targeted inference\-time interventions that promotequestion\-constrained evidence grounding \(QCEG\), thereby reducing incorrect predictions during generation\.

## 2\.Related Work

Due to space limitations, we provide a brief overview of the most relevant prior work here and defer a more comprehensive discussion to the Appendix[A\.1](https://arxiv.org/html/2605.14192#A1.SS1)\.

Retrieval\-Augmented Generation\.Retrieval\-Augmented Generation \(RAG\) improves the factuality and reasoning of large language models by grounding generation in external knowledge\(Zhaoet al\.,[2026](https://arxiv.org/html/2605.14192#bib.bib28); Fanet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib29)\)\. Prior work has explored dense and hybrid retrieval, multi\-hop evidence gathering, iterative retrieval–generation loops, and query reformulation\(Nianet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib30); Tang and Yang,[2024](https://arxiv.org/html/2605.14192#bib.bib31); Trivediet al\.,[2023](https://arxiv.org/html/2605.14192#bib.bib22); Chanet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib32)\)\. Other efforts enhance robustness through context selection, reranking, compression, and prompt engineering\(Donget al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib33); Ampazis,[2024](https://arxiv.org/html/2605.14192#bib.bib34)\)\.

Despite these advances, most approaches treat the language model as a black box and focus on system\-level improvements, offering limited insight into how retrieved evidence is internally processed\. Consequently, they cannot fully explain why errors persist even when relevant evidence is successfully retrieved\. Some recent work evaluates faithfulness and evidence usage\(Zenget al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib19); Liuet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib54)\), but typically relies on representations from a single layer, providing only a static and partial view of the model’s internal computation\.

Interpretability and Circuit Analysis of LLMs\.A parallel line of research investigates transformer internals using attention analysis, Sparse Autoencoders, transcoders, and circuit tracing\(Clarket al\.,[2019](https://arxiv.org/html/2605.14192#bib.bib37); Vig and Belinkov,[2019](https://arxiv.org/html/2605.14192#bib.bib38); Cunninghamet al\.,[2023](https://arxiv.org/html/2605.14192#bib.bib39); Dunefskyet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib36); Elhageet al\.,[2021](https://arxiv.org/html/2605.14192#bib.bib40)\)\. These methods decompose neural representations into interpretable components and reveal that specific behaviors can often be attributed to distributed circuits spanning layers and heads\(Pauloet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib41); Ferrandoet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib42)\)\. Attribution graphs have emerged as a useful abstraction for modeling information flow within networks\(Markset al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib43)\)\.

Although circuit\-level analyses have shed light on reasoning in standalone language models\(Daiet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib57); Zhaoet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib58)\), they rarely consider retrieval\-augmented settings\. As a result, how externally retrieved evidence interacts with internal computational circuits in RAG remains largely unexplored\.

## 3\.Background and Preliminaries

In this section, we formally define the attribution graph and describe how it is constructed from the internal computation of a transformer model\.

### 3\.1\.Definition of Attribution Graphs

We represent token\-level causal interactions inside the model in a graph view\. In particular, we model the interactions among the activations as a directed attribution graphG=\(V,E\)G=\(V,E\)that captures how information flows between token representations across layers during inference\.

Each nodevt,ℓ∈Vv\_\{t,\\ell\}\\in Vcorresponds to the representation of token positionttat transformer layerℓ\\ell\. A directed edge\(vs,k→vt,ℓ\)∈E\(v\_\{s,k\}\\rightarrow v\_\{t,\\ell\}\)\\in Eindicates that the token state at positionssin layerkkcontributes to the token state at positionttin layerℓ\\ell\. The edge weightwwmeasures the strength of this causal contribution\. This graph\-level view allows us to analyze model reasoning as a structured computational process, revealing how evidence is integrated, propagated, and transformed as representations evolve across layers\.

### 3\.2\.Constructing Attribution Graphs

We now describe how token\-level attribution graphs are constructed from a transformer model\.

##### Feature Decomposition as the Node Basis

Following prior work on circuit tracing and attribution\(Dunefskyet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib36)\), we adopt transcoders to decompose residual stream activations when building the attribution graph for a fixed target logit\. At each layerℓ\\elland token positiontt, the residual stream vector is represented as a sparse set of learned activation units, which serve as intermediate carriers of attribution signals\.

Attribution is computed at the level of activation units, reflecting how each unit contributes—directly or indirectly—to the target logit through the network\. These activation\-unit\-level attributions are then aggregated by token position, so that tokens in the prompt, retrieved context, and generated output correspond to nodes in the attribution graph\.

##### Edge Construction via a Linearized Replacement Model

Edge weights in the attribution graph are obtained using a locally linearized replacement model, following existing circuit\-tracing methods\(Ameisenet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib27)\)\. Specifically, we replace MLP blocks with their corresponding transcoders while keeping attention modules unchanged, and fix attention patterns and layer\-normalization terms at their forward\-pass values\. Under this setting, the network computation is linear with respect to activation\-unit activations\.

This linearization allows attribution signals with respect to the target logit to be decomposed into additive contributions between activation units\. These unit\-level attributions are aggregated across units associated with each token pair, yielding directed token\-to\-token attribution scores, which are used as edge weights in the attribution graph\.

## 4\.Circuit Analysis for RAG

In this section, we analyze the internal circuit structure of RAG models to understand how retrieved evidence is integrated during answer generation\. We begin with a general retrieval setting, where the retrieved context is treated as unconstrained\. Under this setting, we compare the attribution\-graph structures ofcorrect and incorrectpredictions to identify systematic differences in how information flows through the model’s internal computation\.

To probe the failure mechanism more directly, we introduce a more challengingmixed\-context settingin which retrieved passages intentionally include both supporting and non\-supporting information\. This scenario better reflects realistic retrieval conditions and places stronger demands on the model’s ability to distinguish relevant evidence from noise\. Analyzing circuit behavior under this mixed setting allows us to study how incorrect reasoning emerges when the model fails to selectively ground its predictions in truly supportive context\.

### 4\.1\.Circuit Analysis of Correct and Incorrect Predictions

This section uses attribution graphs to analyze differences between correct and incorrect predictions in how models internally organize and integrate information\. We examine the structural properties of the model’s internal computation during answer generation\. These structural patterns provide insight into the internal mechanisms that distinguish successful from unsuccessful prediction use\.

#### 4\.1\.1\.Graph Metrics

To understand why some predictions successfully integrate retrieved evidence while others do not, we examine the structural organization of their attribution graphs\. Given an attribution graphG=\(V,E\)G=\(V,E\)defined in Section[3\.1](https://arxiv.org/html/2605.14192#S3.SS1), each example is summarized through a set of graph\-level statistics\. Rather than characterizing individual graphs separately, our goal is to identify systematic structural differences between correct and incorrect reasoning circuits\.

We hypothesize that correct and incorrect predictions differ along three fundamental dimensions of internal evidence integration: \(1\) how far information propagates through the model, \(2\) how strongly token representations interact with one another, and \(3\) how information is organized across local and global structures\. We therefore design a set of graph metrics that quantify each of these aspects\.

##### Propagation depth\.

The first dimension concerns the depth of information propagation\. Correct reasoning may require evidence to travel through multiple intermediate representations, whereas shallow propagation may reflect shortcut or surface\-level processing\. We measure this using the longest directed path length,𝐃𝐀𝐆​\-​𝐋​\(𝑮\)=maxπ∈𝒫​\(G\)⁡\|π\|\\bm\{\\mathrm\{DAG\\text\{\-\}L\}\(G\)\}=\\max\_\{\\pi\\in\\mathcal\{P\}\(G\)\}\|\\pi\|, where𝒫​\(G\)\\mathcal\{P\}\(G\)is the set of directed paths inGG\. Larger values indicate longer multi\-step propagation chains, suggesting more compositional reasoning\.

##### Interaction strength\.

The second dimension captures how strongly token representations interact during computation\. If evidence is effectively integrated, we expect richer connectivity among tokens rather than isolated or weakly connected fragments\.

We measure this using two complementary metrics\. The average degree𝐀𝐯𝐠𝐃𝐞𝐠​\(𝑮\)=1\|V\|​∑v∈V\(degin⁡\(v\)\+degout⁡\(v\)\)\\bm\{\\mathrm\{AvgDeg\}\(G\)\}=\\frac\{1\}\{\|V\|\}\\sum\_\{v\\in V\}\(\\deg^\{\\text\{in\}\}\(v\)\+\\deg^\{\\text\{out\}\}\(v\)\)captures the typical number of interactions per token\. Here,degin⁡\(v\)\\deg^\{\\text\{in\}\}\(v\)anddegout⁡\(v\)\\deg^\{\\text\{out\}\}\(v\)denote the in\-degree and out\-degree of nodevv\. The directed edge density𝐃𝐞𝐧𝐬​\(𝑮\)=\|E\|\|V\|​\(\|V\|−1\)\\bm\{\\mathrm\{Dens\}\(G\)\}=\\frac\{\|E\|\}\{\|V\|\(\|V\|\-1\)\}measures how densely the reasoning circuit is connected overall\. Higher values indicate stronger and more widespread evidence interaction\.

##### Structural organization across scales\.

We further characterize how information is organized at both local and global scales\.

Local fragmentation is captured by the fraction of disconnected triads,𝑻disc​\(𝑮\)=\#​disc​\(G\)/∑τ\#​τ​\(G\)\\bm\{T\_\{\\text\{disc\}\}\(G\)\}=\\\#\\text\{disc\}\(G\)\\big/\\sum\_\{\\tau\}\\\#\\tau\(G\), where\#​d​i​s​c\\\#discis a disconnected triad consists of three nodes with no edges among them\. Larger values indicate that nearby nodes fail to interact, suggesting fragmented local structure\.

Branching\-style local aggregation is measured by𝑻branch​\(𝑮\)=\#​branch​\(G\)/∑τ\#​τ​\(G\)\\bm\{T\_\{\\text\{branch\}\}\(G\)\}=\\\#\\text\{branch\}\(G\)\\big/\\sum\_\{\\tau\}\\\#\\tau\(G\), where a branch triad is a three\-node pattern in which two nodes both point to the same third node\. Higher values indicate that information from multiple sources tends to merge into a single intermediate node, reflecting localized aggregation rather than linear propagation\.

Finally, global concentration of information flow is captured by𝐌𝐚𝐱𝐏𝐑​\(𝑮\)=maxv∈V⁡PR​\(v\)\\bm\{\\mathrm\{MaxPR\}\(G\)\}=\\max\_\{v\\in V\}\\mathrm\{PR\}\(v\), wherePR​\(v\)\\mathrm\{PR\}\(v\)is the PageRank score of nodevv\. Larger values indicate that information flow is dominated by a single hub, whereas lower values suggest more distributed integration\.

Together, these six metrics provide a structural signature of each reasoning circuit\. By comparing these signatures between correct and incorrect predictions, we can identify how successful evidence integration differs from failure at the level of internal computation\.

#### 4\.1\.2\.A Study of Correct and Incorrect Question Answering

##### Setup

We study QA benchmarks \(HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2605.14192#bib.bib44)\), 2WikiMultihopQA\(Hoet al\.,[2020](https://arxiv.org/html/2605.14192#bib.bib45)\), and MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2605.14192#bib.bib46)\)\), where answering requires composing evidence across multiple passages\. For each query, we retrieve a fixed\-size context and generate chain\-of\-thought answers using LLaMA\-3 8B Instruct\(Dubeyet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib55)\)with greedy decoding\. We then assign a binary labely∈\{0,1\}y\\in\\\{0,1\\\}\(incorrect or correct\) using an external LLM\-based judge, Gemini\-2\.5\-Flash\-Lite\(Comaniciet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib56)\)\. Finally, for each dataset, we construct a balanced set of attribution graphs comprising 500 incorrect and 500 correct predictions to enable directly comparable structural analyses across classes\.

Within each dataset, we construct balanced subsets consisting of 500 correct and 500 incorrect predictions, ensuring that structural comparisons between attribution graphs are directly comparable across classes\.

##### Finding 1: Correct Answers Arise from Deeper, More Structured, and More Evenly Distributed Circuits

A consistent structural contrast emerges across datasets:correct reasoning circuits are deep, densely interconnected, and broadly distributed, whereas wrong circuits are shallow, sparse, and overly centralized\.

Figure[1](https://arxiv.org/html/2605.14192#S4.F1)shows this separation across all structural metrics\. Correct answers are supported by attribution graphs that are deeper, more structurally organized, and more evenly distributed in how evidence is utilized\. Incorrect answers, in contrast, arise from circuits that are shallow, fragmented, and overly concentrated around a few dominant nodes\. In addition to LLaMA\-3 8B Instruct, we also analyze Qwen\-3 8B in Figure[8](https://arxiv.org/html/2605.14192#A1.F8)in Appendix\. The results show consistent conclusions: correct and incorrect predictions exhibit clearly different structural patterns\.

Deeper vs\. shallower propagation\.Correct graphs exhibit longer directed propagation depth \(higher DAG\-L\), indicating that evidence signals travel through multi\-step internal routes before reaching the final answer tokens\. Incorrect predictions show shorter directed paths, suggesting that reasoning is truncated and relies on fewer intermediate transformations\.

Structured vs\. fragmented connectivity\.Correct circuits are more structurally organized at both global and local levels\. Globally, they display higher interaction richness, reflected in larger average degree \(AvgDeg\) and edge density \(Dens\), indicating stronger cross\-token coupling and more integrated evidence flow\. Locally, they contain fewer disconnected triads \(lowerTd​i​s​cT\_\{disc\}\), meaning that neighboring nodes are more likely to participate in coordinated interactions\. Incorrect circuits, by contrast, are sparser and more fragmented, with many local node groups remaining structurally isolated\. In addition, higher branching motifs \(higherTbranchT\_\{\\text\{branch\}\}\) indicate that intermediate states more often distribute information to multiple downstream components, reflecting a more structured and distributed reasoning process\.

Distributed vs\. concentrated evidence flow\.Finally, correct circuits make use of evidence in a more evenly distributed manner\. They exhibit lower maximum PageRank values \(MaxPR\), indicating that importance is spread across multiple nodes rather than dominated by a single hub\.

Taken together, these patterns show that correct reasoning emerges from circuits that are deep, structurally coherent, and broadly distributed in their use of evidence\. Incorrect answers, by contrast, arise from shallow, fragmented, and overly centralized circuits in which information either fails to propagate sufficiently or becomes concentrated on a small set of dominant nodes\.

![Refer to caption](https://arxiv.org/html/2605.14192v1/figure/radar.png)Figure 1\.Radar comparison of attribution\-graph structural metrics between correct and wrong predictions across three QA datasets \(2Wiki, HotpotQA, and MuSiQue\)\.![Refer to caption](https://arxiv.org/html/2605.14192v1/figure/musique_layer.png)Figure 2\.Layer\-wise attribution mass for correct and wrong predictions \(left\) and their difference on MuSiQue \(right\)\.
##### Finding 2: Correct Predictions Use More Mid\-Layer Processing

In addition to graph structure, we analyze how attribution mass is distributed across transformer layers\. Figure[2](https://arxiv.org/html/2605.14192#S4.F2)shows a clear depth shift between correct and wrong predictions on MuSique\. Due to space constraints, the corresponding results for HotpotQA and 2Wiki are provided in Figure[9](https://arxiv.org/html/2605.14192#A1.F9)and Figure[10](https://arxiv.org/html/2605.14192#A1.F10)in the Appendix\.

Correct predictions allocate a larger fraction of the total activated neurons to the middle layers \(approximately layers 8–18\), indicating greater reliance on mid\-layer reasoning computations\. These layers are where the model typically combines information from different tokens and builds integrated representations\. The higher activity here suggests that correct answers depend on sustained internal processing that brings together evidence from the question and retrieved context\.

Incorrect predictions follow a different pattern\. They rely more on early layers\. Higher early\-layer activity indicates that the model may be matching surface\-level patterns from the retrieved text without deeper integration\.

Overall, correct answers are associated with deeper and more sustained information processing, while wrong answers tend to reflect shallower decision dynamics\.

![Refer to caption](https://arxiv.org/html/2605.14192v1/figure/output.png)Figure 3\.Region\-level attribution comparison between correct and wrong predictions in the mixed\-context setting\. Bars show relative weights forQ→QQ\\rightarrow Q, andQ→Ans​\_​EXTQ\\rightarrow\\mathrm\{Ans\\\_EXT\}\.

### 4\.2\.Circuit Analysis under Mixed Context

The structural circuit analysis above establishes*what*differentiates correct and incorrect predictions: correct reasoning is supported by deeper, more connected, and more integrative attribution graphs, whereas incorrect predictions rely on fragmented and weakly coordinated structures\. We now turn to a complementary question:*how do these structural differences emerge over the course of computation?*

To address this, we examine the routing dynamics of information across layers during decoding\. Our analysis focuses on a mixed\-context setting\. In this setting, each question is paired with retrieved context that intentionally contains both supporting and non\-supporting passages\.

This scenario is particularly informative because the model must not only leverage external evidence, but also distinguish relevant signals from distractors while remaining aligned with the question\. As a result, it provides a controlled testbed for analyzing how internal routing dynamics differ when evidence selection and integration become genuinely challenging\.

We analyze these dynamics by grouping tokens into functional regions and tracking how attribution mass flows between them across layers\. This provides a stage\-wise view of how the model balances question understanding, reliance on externally grounded answer content, and internally composed answer representations during reasoning\.

![Refer to caption](https://arxiv.org/html/2605.14192v1/figure/Q-Q.png)Figure 4\.Layer\-wise attribution comparison forQ→Ans​\_​EXTQ\\rightarrow\\mathrm\{Ans\\\_EXT\}andQ→QQ\\rightarrow Q\. Left: mean routing strength per layer for correct and wrong predictions\. Right: relative differences \(green = correct higher, red = wrong higher\)\.![Refer to caption](https://arxiv.org/html/2605.14192v1/figure/q-q3.png)Figure 5\.Layer\-wise attribution weight for multiple region\-level edge types under mixed\-context decoding, comparing correct and wrong predictions\.#### 4\.2\.1\.Region\-Level Routing Decomposition

We partition tokens into three functional regions based on their roles in the reasoning process:QQis the input question tokens;Ans​\_​EXT\\mathrm\{Ans\\\_EXT\}indicates answer tokens that are attributable to retrieved external context; andAns​\_​INT\\mathrm\{Ans\\\_INT\}denotes answer tokens that are generated internally by the model and have no direct alignment with retrieved context\.

For each transformer layerℓ\\ell, we measure how attribution flows between these regions\. Leta​\(i→j\)a\(i\\rightarrow j\)denote the weight from source tokeniito target tokenjjat layerℓ\\ellWe then aggregate attribution mass at the region level as

\(1\)AX→Y\(ℓ\)=∑i∈X∑j∈Ya​\(i→j\),X,Y∈\{Q,Ans​\_​EXT,Ans​\_​INT\},A^\{\(\\ell\)\}\_\{X\\rightarrow Y\}=\\sum\_\{i\\in X\}\\sum\_\{j\\in Y\}a\(i\\rightarrow j\),\\qquad X,Y\\in\\\{Q,\\mathrm\{Ans\\\_EXT\},\\mathrm\{Ans\\\_INT\}\\\},
whereAX→Y\(ℓ\)A^\{\(\\ell\)\}\_\{X\\rightarrow Y\}measures the total attribution routed from regionXXto regionYYat layerℓ\\ellby summing over all source–target token pairs\(i,j\)\(i,j\)withi∈Xi\\in Xandj∈Yj\\in Y\. This aggregation yields a layer\-wise routing profile that reveals how the model distributes computation between understanding the question, leveraging externally aligned answer content, and developing internally composed answer representations\.

#### 4\.2\.2\.A Study of Reasoning Patterns in the Mixed\-Context Setting

##### Setup

Our analysis centers on a mixed\-context setting derived from MuSiQue, which we termMix\-MuSiQue\. In this variant, each question is paired with a retrieved context deliberately constructed to include both supporting and non\-supporting passages\. This design creates a controlled mixed\-evidence scenario that tests the model’s ability to selectively use relevant information while ignoring distractors\. The dataset comprises 667 questions under this setting\. Answers are generated using LLaMA\-3 8B Instruct with greedy decoding, and the resulting responses are evaluated by an external LLM judge, Gemini\-2\.5\-Flash\-Lite\.

##### Overall Pattern: Question\-Guided Reasoning vs\. External Over\-Reliance

Across layers, a clear global contrast emerges:Correct predictions emphasize question understanding, whereas incorrect predictions over\-rely on externally aligned answer content\.

As shown in Figure[3](https://arxiv.org/html/2605.14192#S4.F3), correct samples consistently allocate more routing mass toQ→QQ\\rightarrow Q\. This indicates that the model places greater emphasis on understanding the question, first building a stable internal representation of the reasoning objective and then continuing to use it as a constraint while forming the answer\.

Incorrect samples, in contrast, show relatively weaker question consolidation and stronger routing fromQ→Ans​\_​EXTQ\\rightarrow\\mathrm\{Ans\\\_EXT\}\. This suggests a shortcut strategy: instead of constructing the answer through question\-guided reasoning, the model leans heavily on answer fragments that are directly supported by retrieved content\. As a result, the model tends to use context that is superficially aligned with the question, yielding locally plausible reasoning steps, while remaining misaligned with the deeper semantic constraints required to correctly solve the problem\.

Thus, the key difference is not simply how much external information is used, but*whether answer formation is anchored in a well\-formed question representation*\.

##### Layer\-wise Distribution of Attribution Weight

While the previous section examined the overall routing distribution, we now provide a more detailed layer\-wise analysis to better understand how routing patterns evolve across the network, as shown in Figure[4](https://arxiv.org/html/2605.14192#S4.F4)and Figure[5](https://arxiv.org/html/2605.14192#S4.F5)\.

##### Low Layers \(0\-7\): Establishing the Question Anchor

The divergence begins in the lowest layers\. Correct predictions devote substantially more routing toQ→QQ\\rightarrow Q, indicating stronger internal consolidation of the question before heavy involvement of answer\-related representations\. This early investment builds a stable semantic anchor that guides downstream reasoning\.

Wrong predictions, however, show weakerQ→QQ\\rightarrow Qrouting and relatively stronger early routing fromQQtowardAns​\_​EXT\\mathrm\{Ans\\\_EXT\}\. The model starts linking the question directly to externally aligned answer content before fully stabilizing the question representation itself\. As a result, early processing is driven more by surface alignment with retrieved information than by a structured internal reasoning objective\.

This early imbalance sets the stage for later errors: when question understanding is shallow, answer formation becomes vulnerable to external bias\.

##### Higher Layers \(8–31\): Answer\-Focused Refinement

In the higher layers, routing patterns become broadly answer\-centric for*both*correct and wrong predictions\. The dominant activity shifts toward refining internal answer representations, consistent with a late stage where the model mainly stabilizes the emerging answer and expresses it in fluent natural language\. In this regime, the model primarily elaborates and stabilizes the evolving answer state\. Correct predictions, however, still retain slightly strongerQ→QQ\\rightarrow Qrouting than wrong ones, indicating a small but persistent influence of the question even at late stages\.

##### Summary: A Depth\-Wise Shift from Question Guidance to External Drift

Correct and incorrect reasoning exhibit markedly different depth\-wise trajectories\. Correct predictions begin with strong comprehension of the question representation, followed by answer formation that remains consistently constrained by it—a pattern we termquestion\-constrained evidence grounding \(QCEG\)\. In contrast, incorrect predictions show weak early question grounding, transition prematurely toward externally aligned answer content, and ultimately refine answers under the dominance of externally driven\. We refer to this failure mode assurface\-aligned evidence grounding \(SAEG\)\.

Thus, mixed\-context errors arise from a progressive routing shift: computation moves away from question\-guided reasoning and toward externally driven answer construction\. Once this shift occurs in early layers, later processing may tends to amplify the misalignment\.

## 5\.Graph\-Structural Detection of Unfaithful Predictions

Our structural analysis reveals that correct and unfaithful predictions differ in the internal organization of their reasoning circuits\. Correct answers tend to be supported by deeper, more integrated, and more coherent attribution graphs, whereas unfaithful answers arise from fragmented and weakly coordinated structures\.

We leverage this insight by framing faithfulness detection as a*graph classification*problem\. If structural organization systematically differs between correct and unfaithful reasoning, then a model should be able to predict answer faithfulness directly from the structure of its attribution graph\.

Concretely, given an attribution graphG=\(V,E\)G=\(V,E\)constructed from a model prediction, our goal is to estimatep​\(y=1∣G\)p\(y=1\\mid G\), the probability that the prediction is faithful \(y=1y=1\) rather than unfaithful \(y=0y=0\), using only internal structural information\. To achieve this, we employ a graph neural architecture that captures both local evidence propagation patterns and global circuit organization\.

### 5\.1\.Graph Features

Each attribution graphGGcontains nodesv∈Vv\\in Vrepresenting question tokens, retrieved context tokens, intermediate activations, and generated tokens, and directed edges\(u,v\)∈E\(u,v\)\\in Erepresenting causal attribution links\. Each edge carries a scalar weightwu​vw\_\{uv\}indicating attribution strength\.

##### Node features\.

Each nodevvis associated with a feature vector𝐱v∈ℝdx\\mathbf\{x\}\_\{v\}\\in\\mathbb\{R\}^\{d\_\{x\}\}, which concatenates a one\-hot encoding of node type \(e\.g\., question, context, answer, intermediate\) with normalized structural signals such as in\-degree, out\-degree, total degree, and PageRank score\. These features describe both the functional role and the local structural importance of each node\.

##### Edge features\.

Each directed edge\(u,v\)\(u,v\)is assigned a one\-dimensional feature𝐞u​v=tanh⁡\(wu​v\)\\mathbf\{e\}\_\{uv\}=\\tanh\(w\_\{uv\}\), a bounded transformation of the attribution weight\. This scalar encodes how strongly information flows fromuutovvwithin the reasoning circuit\.

##### Graph\-level topology signatures\.

In addition to node\- and edge\-level information, we compute a vector of global structural statistics𝐠​\(G\)∈ℝdg\\mathbf\{g\}\(G\)\\in\\mathbb\{R\}^\{d\_\{g\}\}, including measures such as longest\-path depth, average degree, triad ratios, graph density, and maximum PageRank, as analyzed in Section[4\.1\.1](https://arxiv.org/html/2605.14192#S4.SS1.SSS1)\. These metrics summarize the overall organizational patterns of the circuit\.

### 5\.2\.Graph Transformer Encoder

To capture both local evidence propagation and long\-range structural interactions, we use a graph transformer encoder that alternates between message passing and attention\. Let𝐡v\(0\)=MLPin​\(𝐱v\)\\mathbf\{h\}\_\{v\}^\{\(0\)\}=\\mathrm\{MLP\}\_\{\\text\{in\}\}\(\\mathbf\{x\}\_\{v\}\)be the initial node embedding\. Each layerℓ=1,…,L\\ell=1,\\dots,Lthen performs two stages\.

##### Local structural propagation\.

We first update node states using a message passing operator that aggregates information from immediate neighbors:

𝐡~v\(ℓ\)=𝐡v\(ℓ−1\)\+MPNN​\(v,\{𝐡u\(ℓ−1\),𝐞u​v\}u∈𝒩​\(v\)\),\\tilde\{\\mathbf\{h\}\}\_\{v\}^\{\(\\ell\)\}=\\mathbf\{h\}\_\{v\}^\{\(\\ell\-1\)\}\+\\mathrm\{MPNN\}\\\!\\left\(v,\\\{\\mathbf\{h\}\_\{u\}^\{\(\\ell\-1\)\},\\mathbf\{e\}\_\{uv\}\\\}\_\{u\\in\\mathcal\{N\}\(v\)\}\\right\),where𝒩​\(v\)\\mathcal\{N\}\(v\)denotes the neighbors ofvv\. This step models how evidence locally accumulates along attribution edges\.

##### Global attention interaction\.

We then apply a graph attention mechanism to allow non\-local structural interactions:

𝐡v\(ℓ\)=𝐡~v\(ℓ\)\+∑u∈Vαv​u\(ℓ\)​𝐖\(ℓ\)​𝐡~u\(ℓ\),\\mathbf\{h\}\_\{v\}^\{\(\\ell\)\}=\\tilde\{\\mathbf\{h\}\}\_\{v\}^\{\(\\ell\)\}\+\\sum\_\{u\\in V\}\\alpha\_\{vu\}^\{\(\\ell\)\}\\mathbf\{W\}^\{\(\\ell\)\}\\tilde\{\\mathbf\{h\}\}\_\{u\}^\{\(\\ell\)\},where attention weightsαv​u\(ℓ\)\\alpha\_\{vu\}^\{\(\\ell\)\}are computed from node features and edge features\. This stage enables the model to capture long\-range coordination across different parts of the reasoning circuit\. AfterLLlayers, each node has a final representation𝐡v\(L\)\\mathbf\{h\}\_\{v\}^\{\(L\)\}encoding both local and global structural context\.

### 5\.3\.Graph\-Level Readout

We convert node embeddings into a graph representation using simple multi\-statistic pooling:

𝐡pool​\(G\)=\[meanv∈V​𝐡v\(L\)​‖sumv∈V​𝐡v\(L\)‖​maxv∈V​𝐡v\(L\)\]\.\\mathbf\{h\}\_\{\\text\{pool\}\}\(G\)=\\big\[\\,\\mathrm\{mean\}\_\{v\\in V\}\\,\\mathbf\{h\}\_\{v\}^\{\(L\)\}\\;\\\|\\;\\mathrm\{sum\}\_\{v\\in V\}\\,\\mathbf\{h\}\_\{v\}^\{\(L\)\}\\;\\\|\\;\\mathrm\{max\}\_\{v\\in V\}\\,\\mathbf\{h\}\_\{v\}^\{\(L\)\}\\,\\big\]\.Here∥\\\|\\,denotes concatenation\. Mean pooling captures the overall structural tendency, sum pooling captures the total evidence mass, and max pooling captures the strongest pathway signal\.

We additionally embed a small set of global topology statistics𝐠​\(G\)∈ℝdg\\mathbf\{g\}\(G\)\\in\\mathbb\{R\}^\{d\_\{g\}\}:

𝐡g​\(G\)=ϕg​\(𝐠​\(G\)\),\\mathbf\{h\}\_\{g\}\(G\)=\\phi\_\{g\}\(\\mathbf\{g\}\(G\)\),whereϕg\\phi\_\{g\}is a small MLP\. The final graph representation is simply

𝐳​\(G\)=\[𝐡pool​\(G\)∥𝐡g​\(G\)\]\.\\mathbf\{z\}\(G\)=\\big\[\\mathbf\{h\}\_\{\\text\{pool\}\}\(G\)\\;\\\|\\;\\mathbf\{h\}\_\{g\}\(G\)\\big\]\.A final classifier producesp​\(y=1∣G\)p\(y=1\\mid G\)from𝐳​\(G\)\\mathbf\{z\}\(G\)\.

### 5\.4\.Unfaithful Prediction

A final classifier produces the probability that the prediction is correct:

p​\(y=1∣G\)=softmax​\(MLPout​\(𝐳​\(G\)\)\)1\.p\(y=1\\mid G\)=\\mathrm\{softmax\}\\big\(\\mathrm\{MLP\}\_\{\\text\{out\}\}\(\\mathbf\{z\}\(G\)\)\\big\)\_\{1\}\.We predict an output as unfaithful whenp​\(y=1∣G\)<0\.5p\(y=1\\mid G\)<0\.5\.

This detector checks whether the model’s internal reasoning forms a coherent structure or a fragmented one\. Local message passing tracks how evidence flows along attribution paths, global attention connects distant but related parts of the circuit, and graph\-level statistics summarize the overall depth and organization\. Together, these signals reveal reasoning failures that are not visible from the answer text or retrieved documents alone\.

### 5\.5\.Results

We now evaluate whether internal reasoning structure can be used to reliably predict when a RAG model’s answer is correct\.

#### 5\.5\.1\.Experimental Setup\.

We evaluate detection on attribution graphs derived from HotpotQA, 2WikiMultihopQA, and MuSiQue\. For each dataset, we follow the fixed graph\-based split defined during graph construction: up to 500 wrong and 500 correct examples are collected in deterministic filename order, 250 per class are sampled for training/validation, and the remaining 500 examples form a balanced test set\. All methods are evaluated on the same fixed indices to ensure direct comparability\.

As a non\-structural baseline, we use a logit\-based self\-judging signal computed from the model’s own output distribution, without access to gold answers\. After generating an answer, the same model is prompted to judge whether its prediction is correct given the question, retrieved context, and its own reasoning trace, and is restricted to a binary response \(“Yes” or “No”\)\. We computelog⁡p​\(Yes\)\\log p\(\\text\{Yes\}\)andlog⁡p​\(No\)\\log p\(\\text\{No\}\)for the two continuations and predict correctness whenlog⁡p​\(Yes\)\>log⁡p​\(No\)\\log p\(\\text\{Yes\}\)\>\\log p\(\\text\{No\}\)\.

#### 5\.5\.2\.Graph Detector Training\.

The graph detector uses a graph transformer encoder withL=2L=2layer and hidden size 128, trained with AdamW \(learning rate10−410^\{\-4\}, batch size 32\) and dropout 0\.1\. For each dataset, we construct a balanced set of 1,000 graphs \(500 incorrect and 500 correct\) and use a fixed split with 250 graphs per class for training/validation and 250 per class for testing\.

#### 5\.5\.3\.Detection Performance\.

Figure[6](https://arxiv.org/html/2605.14192#S5.F6)shows that the graph\-structural detector consistently outperforms the logit\-based self\-judging baseline on all three QA benchmarks\. Averaged across datasets, our method improves accuracy by 11\.53%\.

These results demonstrate that modeling internal reasoning structure provides a substantially more reliable signal of answer correctness than relying on the model’s own output confidence\. While logit\-based self\-evaluation reflects surface\-level uncertainty, it cannot determine whether evidence has been integrated through a coherent multi\-step reasoning circuit\. In contrast, the graph\-structural detector directly measures the organization of evidence flow, enabling more accurate detection of unfaithful predictions under challenging retrieval conditions\.

![Refer to caption](https://arxiv.org/html/2605.14192v1/figure/three.png)Figure 6\.Performance comparison across QA benchmarks\.

## 6\.Attention Intervention for RAG Improvement

Section[4\.2\.2](https://arxiv.org/html/2605.14192#S4.SS2.SSS2.Px1)shows that incorrect predictions under mixed retrieval are not random errors but arise from a routing pattern: the model under\-invests in early question consolidation, over\-commits to surface\-matched context, and later drifts into self\-reinforcing decoding that is weakly constrained by the question\. This section turns that diagnosis into an actionable control mechanism\. Instead of changing parameters or re\-retrieving, we intervene directly on the attention computation during decoding to encourage the routing behavior characteristic of correct predictions\.

### 6\.1\.Intervention as Layer\-Wise Routing Control

The analysis above reveals that RAG errors arise from a systematic shift in how information is routed across layers\. Correct predictions maintain strong question grounding throughout the computation, whereas wrong predictions exhibit two coupled failures: \(i\) insufficient early consolidation of the question representation, and \(ii\) progressively increasing reliance on externally information\.

We therefore design an intervention that directly reshapes routing preferences inside the attention mechanism\. The goal is not to retrain the model, but to gently bias information flow toward the routing regime associated with correct reasoning\. We partition token positions into three semantically meaningful regions:

- •QQ: question tokens,
- •E​xEx: external retrieved context tokens,
- •I​nIn: internally generated tokens\.

#### 6\.1\.1\.Control 1: Strengthening Early Question Understanding

Section[4\.2\.2](https://arxiv.org/html/2605.14192#S4.SS2.SSS2.Px1)shows that wrong predictions underutilizeQ→QQ\\rightarrow Qrouting in early layers\. As a result, the question representation is not sufficiently understood before interacting with retrieved context\.

To counteract this, we amplify attention among question tokens in lower layers:

αQ→Q\(ℓ\)=αQQfor​ℓ∈ℒlow,αQQ\>1\.\\alpha^\{\(\\ell\)\}\_\{Q\\rightarrow Q\}=\\alpha\_\{\\text\{QQ\}\}\\quad\\text\{for \}\\ell\\in\\mathcal\{L\}\_\{\\text\{low\}\},\\quad\\alpha\_\{\\text\{QQ\}\}\>1\.
This encourages deeper internal integration of question semantics before heavy evidence mixing occurs\.

#### 6\.1\.2\.Control 2: Suppressing Premature Context Reliance

We further observe that incorrect predictions tend to route attention toward external context too early, leading to brittle surface alignment rather than deep reasoning\.

We therefore down\-weight attention whose \*target\* lies in the external region during the same early stage:

α∗⁣→E​x\(ℓ\)=αctxfor​ℓ∈ℒlow,αctx<1\.\\alpha^\{\(\\ell\)\}\_\{\\ast\\rightarrow Ex\}=\\alpha\_\{\\text\{ctx\}\}\\quad\\text\{for \}\\ell\\in\\mathcal\{L\}\_\{\\text\{low\}\},\\quad\\alpha\_\{\\text\{ctx\}\}<1\.
Here∗\\astdenotes any source region\. This control reduces the influence of retrieved tokens before the question representation has stabilized\.

#### 6\.1\.3\.Control 3: Maintaining Question\-Guided Decoding

In later layers, answer tokens are iteratively refined\. For correct predictions, routing from the question to internal answer states remains active, ensuring that decoding stays constrained by the original task\. Incorrect predictions, in contrast, show weakeningQ→I​nQ\\rightarrow Inrouting and increasing self\-reinforcement among answer tokens\.

To maintain question guidance during decoding, we strengthen attention from question tokens to internally generated tokens in higher layers:

αQ→I​n\(ℓ\)=αQInfor​ℓ∈ℒhigh,αQIn\>1\.\\alpha^\{\(\\ell\)\}\_\{Q\\rightarrow In\}=\\alpha\_\{\\text\{QIn\}\}\\quad\\text\{for \}\\ell\\in\\mathcal\{L\}\_\{\\text\{high\}\},\\quad\\alpha\_\{\\text\{QIn\}\}\>1\.

#### 6\.1\.4\.How is the Control Applied in the Model?

The proposed intervention does not alter any model parameters and requires no retraining\. Instead, it operates directly on the model’s internal attention computation during inference\. We implement this control using forward hooks inserted into selected transformer layers\.

A hook is a lightweight function inserted into the model’s forward pass that intercepts and modifies intermediate activations without changing model parameters\. In our case, the hook intercepts the attention pattern immediately before it is used to aggregate value vectors\. At chosen layers, we rescale specific groups of attention weights based on the semantic regions of the source and target tokens \(question, retrieved context, or generated answer tokens\) as well as the layer index\.

Because the modification occurs at the level of attention weights, it changes the relative influence that different token groups exert on one another, thereby steering information routing inside the network\. Importantly, all original model weights remain frozen: the intervention introduces only a small amount of element\-wise scaling applied on\-the\-fly during the forward pass\. As a result, the computational overhead is negligible, and the model’s base behavior can be fully restored by simply removing the hooks\.

### 6\.2\.Results

![Refer to caption](https://arxiv.org/html/2605.14192v1/figure/control.png)Figure 7\.Performance comparison on Mix\-MusiQue\.We now evaluate whether the proposed layer\-wise routing control improves answer faithfulness in the mixed\-context setting\. We focus on overall answer accuracy, which reflects whether the model ultimately produces the correct answer under the presence of both supporting and distracting evidence\.

#### 6\.2\.1\.Setup

Our experiments focuses on a mixed\-context setting that we construct based on MuSiQue, which we refer to asMix\-MuSiQue\. In this dataset, each question is paired with a retrieved context that intentionally contains both supporting and non\-supporting passages\. The final evaluation set consists of 667 questions under this mixed\-evidence condition\. Answers are generated using LLaMA\-3 8B Instruct\. We then evaluate the responses using an external LLM judge, Gemini\-2\.5\-Flash\-Lite\.

We compare two conditions: Before Control, the standard model without intervention, and After Control, where region\-aware attention reweighting is enabled in lower and higher layers\. All other decoding settings are kept identical to ensure a fair comparison\. For our method, we setαQQ=1\.5\\alpha\_\{\\text\{QQ\}\}=1\.5, which promotes stronger understanding of question semantics before substantial interaction with retrieved evidence\. We setαctx=0\.5\\alpha\_\{\\text\{ctx\}\}=0\.5, mitigating premature over\-reliance on external information in early layers\. Finally, we setαQIn=1\.5\\alpha\_\{\\text\{QIn\}\}=1\.5, which biases later computation toward refining answer representations under continued guidance from the question\.

#### 6\.2\.2\.Intervention Results in the Mixed\-Context Setting

Figure[7](https://arxiv.org/html/2605.14192#S6.F7)reports performance on the Mix\-Musique setting, where the retrieved context contains both supporting and non\-supporting information\. This mixed\-evidence scenario is particularly challenging because the model must distinguish useful evidence from distractors while maintaining alignment with the question\.

Without intervention, the baseline model attains 56\.5% accuracy\. Applying our attention control raises performance to 61\.6%, representing a 9% improvement\. This consistent gain demonstrates that the proposed intervention effectively reshapes the model’s internal information routing under mixed\-context conditions\.

The improvement suggests that errors in this setting are closely tied to how the model allocates attention across question and context tokens\. By strengthening question\-grounded routing and reducing early over\-reliance on retrieved context, the intervention helps the model better integrate relevant evidence while suppressing distractors, leading to more faithful answer generation\. In addition, we present a case study in Appendix[A\.2](https://arxiv.org/html/2605.14192#A1.SS2)due to space limitations\.

## 7\.Conclusion

We present a graph\-based perspective on why retrieval\-augmented generation can fail even when relevant evidence is available\. Attribution graphs reveal clear structural differences between success and failure: correct predictions show deeper, more distributed, and question\-constrained evidence integration, while errors exhibit shallow, context\-dominated routing\. These insights enable graph\-based error detection and targeted inference\-time interventions that reshape internal information flow\.

## References

- E\. Ameisen, J\. Lindsey, A\. Pearce, W\. Gurnee, N\. L\. Turner, B\. Chen, C\. Citro, D\. Abrahams, S\. Carter, B\. Hosmer, J\. Marcus, M\. Sklar, A\. Templeton, T\. Bricken, C\. McDougall, H\. Cunningham, T\. Henighan, A\. Jermyn, A\. Jones, A\. Persic, Z\. Qi, T\. Ben Thompson, S\. Zimmerman, K\. Rivoire, T\. Conerly, C\. Olah, and J\. Batson \(2025\)Circuit tracing: revealing computational graphs in language models\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2025/attribution-graphs/methods.html)Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p3.1),[§3\.2](https://arxiv.org/html/2605.14192#S3.SS2.SSS0.Px2.p1.1)\.
- N\. Ampazis \(2024\)Improving rag quality for large language models with topic\-enhanced reranking\.InIFIP international conference on artificial intelligence applications and innovations,pp\. 74–87\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p2.1),[§2](https://arxiv.org/html/2605.14192#S2.p2.1)\.
- A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi \(2023\)Self\-rag: self\-reflective retrieval augmented generation\.InNeurIPS 2023 workshop on instruction tuning and instruction following,Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- O\. Ayala and P\. Bechard \(2024\)Reducing hallucination in structured outputs via retrieval\-augmented generation\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 6: Industry Track\),pp\. 228–238\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- C\. Chan, C\. Xu, R\. Yuan, H\. Luo, W\. Xue, Y\. Guo, and J\. Fu \(2024\)Rq\-rag: learning to refine queries for retrieval augmented generation\.arXiv preprint arXiv:2404\.00610\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p1.1),[§2](https://arxiv.org/html/2605.14192#S2.p2.1)\.
- J\. Chen, H\. Lin, X\. Han, and L\. Sun \(2024\)Benchmarking large language models in retrieval\-augmented generation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 17754–17762\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- K\. Clark, U\. Khandelwal, O\. Levy, and C\. D\. Manning \(2019\)What does bert look at? an analysis of bert’s attention\.arXiv preprint arXiv:1906\.04341\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p4.1),[§2](https://arxiv.org/html/2605.14192#S2.p4.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§4\.1\.2](https://arxiv.org/html/2605.14192#S4.SS1.SSS2.Px1.p1.1)\.
- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2023\)Sparse autoencoders find highly interpretable features in language models\.arXiv preprint arXiv:2309\.08600\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p4.1),[§2](https://arxiv.org/html/2605.14192#S2.p4.1)\.
- X\. Dai, K\. Guo, C\. Lo, S\. Zeng, J\. Ding, D\. Luo, S\. Mukherjee, and J\. Tang \(2025\)GraphGhost: tracing structures behind large language models\.arXiv preprint arXiv:2510\.08613\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p5.1),[§2](https://arxiv.org/html/2605.14192#S2.p5.1)\.
- J\. Dong, B\. Fatemi, B\. Perozzi, L\. F\. Yang, and A\. Tsitsulin \(2024\)Don’t forget to connect\! improving rag with graph\-based reranking\.arXiv preprint arXiv:2405\.18414\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p2.1),[§2](https://arxiv.org/html/2605.14192#S2.p2.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv e\-prints,pp\. arXiv–2407\.Cited by:[§4\.1\.2](https://arxiv.org/html/2605.14192#S4.SS1.SSS2.Px1.p1.1)\.
- J\. Dunefsky, P\. Chlenski, and N\. Nanda \(2024\)Transcoders find interpretable llm feature circuits\.Advances in Neural Information Processing Systems37,pp\. 24375–24410\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p4.1),[§2](https://arxiv.org/html/2605.14192#S2.p4.1),[§3\.2](https://arxiv.org/html/2605.14192#S3.SS2.SSS0.Px1.p1.2)\.
- D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, D\. Metropolitansky, R\. O\. Ness, and J\. Larson \(2024\)From local to global: a graph rag approach to query\-focused summarization\.arXiv preprint arXiv:2404\.16130\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p2.1)\.
- N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, N\. DasSarma, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah \(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread\.Note:https://transformer\-circuits\.pub/2021/framework/index\.htmlCited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p4.1),[§2](https://arxiv.org/html/2605.14192#S2.p4.1)\.
- W\. Fan, Y\. Ding, L\. Ning, S\. Wang, H\. Li, D\. Yin, T\. Chua, and Q\. Li \(2024\)A survey on rag meeting llms: towards retrieval\-augmented large language models\.InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining,pp\. 6491–6501\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p1.1),[§2](https://arxiv.org/html/2605.14192#S2.p2.1)\.
- J\. Ferrando, O\. Obeso, S\. Rajamanoharan, and N\. Nanda \(2024\)Do i know this entity? knowledge awareness and hallucinations in language models\.arXiv preprint arXiv:2411\.14257\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p4.1),[§2](https://arxiv.org/html/2605.14192#S2.p4.1)\.
- Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, H\. Wang, and H\. Wang \(2023\)Retrieval\-augmented generation for large language models: a survey\.arXiv preprint arXiv:2312\.109972\(1\)\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- K\. Guo, H\. Shomer, S\. Zeng, H\. Han, Y\. Wang, and J\. Tang \(2025\)Empowering graphrag with knowledge filtering and integration\.arXiv preprint arXiv:2503\.13804\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- S\. Gupta, R\. Ranjan, and S\. N\. Singh \(2024\)A comprehensive survey of retrieval\-augmented generation \(rag\): evolution, current landscape and future directions\.arXiv preprint arXiv:2410\.12837\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- H\. Han, Y\. Wang, H\. Shomer, K\. Guo, J\. Ding, Y\. Lei, M\. Halappanavar, R\. A\. Rossi, S\. Mukherjee, X\. Tang,et al\.\(2024\)Retrieval\-augmented generation with graphs \(graphrag\)\.arXiv preprint arXiv:2501\.00309\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- X\. Ho, A\. D\. Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing a multi\-hop qa dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics,pp\. 6609–6625\.Cited by:[§4\.1\.2](https://arxiv.org/html/2605.14192#S4.SS1.SSS2.Px1.p1.1)\.
- W\. Hu, W\. Zhang, Y\. Jiang, C\. J\. Zhang, X\. Wei, and L\. Qing \(2025\)Removal of hallucination on hallucination: debate\-augmented rag\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 15839–15853\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- D\. Lee, Y\. Jo, H\. Park, and M\. Lee \(2025\)Shifting from ranking to set selection for retrieval augmented generation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 17606–17619\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p2.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- Z\. Liu, R\. A\. Amjad, R\. Adkathimar, T\. Wei, and H\. Tong \(2025\)SelfElicit: your language model secretly knows where is the relevant evidence\.arXiv preprint arXiv:2502\.08767\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p3.1),[§1](https://arxiv.org/html/2605.14192#S1.p2.1),[§2](https://arxiv.org/html/2605.14192#S2.p3.1)\.
- S\. Marks, C\. Rager, E\. J\. Michaud, Y\. Belinkov, D\. Bau, and A\. Mueller \(2024\)Sparse feature circuits: discovering and editing interpretable causal graphs in language models\.arXiv preprint arXiv:2403\.19647\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p4.1),[§2](https://arxiv.org/html/2605.14192#S2.p4.1)\.
- J\. Nian, Z\. Peng, Q\. Wang, and Y\. Fang \(2025\)W\-rag: weakly supervised dense retrieval in rag for open\-domain question answering\.InProceedings of the 2025 International ACM SIGIR conference on innovative concepts and theories in information retrieval \(ICTIR\),pp\. 136–146\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p1.1),[§2](https://arxiv.org/html/2605.14192#S2.p2.1)\.
- C\. Niu, Y\. Wu, J\. Zhu, S\. Xu, K\. Shum, R\. Zhong, J\. Song, and T\. Zhang \(2024\)Ragtruth: a hallucination corpus for developing trustworthy retrieval\-augmented language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 10862–10878\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- G\. Paulo, A\. Mallen, C\. Juang, and N\. Belrose \(2024\)Automatically interpreting millions of features in large language models\.arXiv preprint arXiv:2410\.13928\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p4.1),[§2](https://arxiv.org/html/2605.14192#S2.p4.1)\.
- B\. Peng, Y\. Zhu, Y\. Liu, X\. Bo, H\. Shi, C\. Hong, Y\. Zhang, and S\. Tang \(2025\)Graph retrieval\-augmented generation: a survey\.ACM Transactions on Information Systems44\(2\),pp\. 1–52\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- Z\. Shao, Y\. Gong, Y\. Shen, M\. Huang, N\. Duan, and W\. Chen \(2023\)Enhancing retrieval\-augmented large language models with iterative retrieval\-generation synergy\.arXiv preprint arXiv:2305\.15294\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- W\. Su, Y\. Tang, Q\. Ai, J\. Yan, C\. Wang, H\. Wang, Z\. Ye, Y\. Zhou, and Y\. Liu \(2025\)Parametric retrieval augmented generation\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1240–1250\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- Y\. Tang and Y\. Yang \(2024\)Multihop\-rag: benchmarking retrieval\-augmented generation for multi\-hop queries\.arXiv preprint arXiv:2401\.15391\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p1.1),[§2](https://arxiv.org/html/2605.14192#S2.p2.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.Cited by:[§4\.1\.2](https://arxiv.org/html/2605.14192#S4.SS1.SSS2.Px1.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2023\)Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 10014–10037\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p1.1),[§1](https://arxiv.org/html/2605.14192#S1.p2.1),[§2](https://arxiv.org/html/2605.14192#S2.p2.1)\.
- J\. Vig and Y\. Belinkov \(2019\)Analyzing the structure of attention in a transformer language model\.arXiv preprint arXiv:1906\.04284\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p4.1),[§2](https://arxiv.org/html/2605.14192#S2.p4.1)\.
- Z\. Wang, J\. Araki, Z\. Jiang, M\. R\. Parvez, and G\. Neubig \(2023\)Learning to filter context for retrieval\-augmented generation\.arXiv preprint arXiv:2311\.08377\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- W\. Wu, H\. Wang, B\. Li, P\. Huang, X\. Zhao, and L\. Liang \(2025\)Multirag: a knowledge\-guided framework for mitigating hallucination in multi\-source retrieval augmented generation\.In2025 IEEE 41st International Conference on Data Engineering \(ICDE\),pp\. 3070–3083\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p2.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2369–2380\.Cited by:[§4\.1\.2](https://arxiv.org/html/2605.14192#S4.SS1.SSS2.Px1.p1.1)\.
- Y\. Yu, W\. Ping, Z\. Liu, B\. Wang, J\. You, C\. Zhang, M\. Shoeybi, and B\. Catanzaro \(2024\)Rankrag: unifying context ranking with retrieval\-augmented generation in llms\.Advances in Neural Information Processing Systems37,pp\. 121156–121184\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p2.1)\.
- S\. Zeng, J\. Zhang, B\. Li, Y\. Lin, T\. Zheng, D\. Everaert, H\. Lu, H\. Liu, Y\. Xing, M\. X\. Cheng,et al\.\(2025\)Towards knowledge checking in retrieval\-augmented generation: a representation perspective\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 2952–2969\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p3.1),[§1](https://arxiv.org/html/2605.14192#S1.p2.1),[§2](https://arxiv.org/html/2605.14192#S2.p3.1)\.
- P\. Zhao, H\. Zhang, Q\. Yu, Z\. Wang, Y\. Geng, F\. Fu, L\. Yang, W\. Zhang, J\. Jiang, and B\. Cui \(2026\)Retrieval\-augmented generation for ai\-generated content: a survey\.Data Science and Engineering,pp\. 1–29\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p1.1),[§2](https://arxiv.org/html/2605.14192#S2.p2.1)\.
- Z\. Zhao, Y\. Koishekenov, X\. Yang, N\. Murray, and N\. Cancedda \(2025\)Verifying chain\-of\-thought reasoning via its computational graph\.arXiv preprint arXiv:2510\.09312\.Cited by:[§A\.1](https://arxiv.org/html/2605.14192#A1.SS1.p5.1),[§2](https://arxiv.org/html/2605.14192#S2.p5.1)\.
- X\. Zheng, Z\. Weng, Y\. Lyu, L\. Jiang, H\. Xue, B\. Ren, D\. Paudel, N\. Sebe, L\. Van Gool, and X\. Hu \(2025\)Retrieval augmented generation and understanding in vision: a survey and new outlook\.arXiv preprint arXiv:2503\.18016\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.
- Y\. Zhou, Y\. Liu, X\. Li, J\. Jin, H\. Qian, Z\. Liu, C\. Li, Z\. Dou, T\. Ho, and P\. S\. Yu \(2024\)Trustworthiness in retrieval\-augmented generation systems: a survey\.arXiv preprint arXiv:2409\.10102\.Cited by:[§1](https://arxiv.org/html/2605.14192#S1.p1.1)\.

## Appendix AAppendix

### A\.1\.Related Work

Retrieval\-Augmented Generation\.Retrieval\-Augmented Generation \(RAG\) has become a widely adopted paradigm for improving the factuality and reasoning capabilities of large language models by grounding generation in external knowledge sources\(Zhaoet al\.,[2026](https://arxiv.org/html/2605.14192#bib.bib28); Fanet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib29)\)\. Prior work has explored a broad range of retrieval strategies, including dense and hybrid retrievers\(Nianet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib30)\), multi\-hop retrieval\(Tang and Yang,[2024](https://arxiv.org/html/2605.14192#bib.bib31)\), iterative retrieval–generation loops\(Trivediet al\.,[2023](https://arxiv.org/html/2605.14192#bib.bib22)\), and query reformulation\(Chanet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib32)\)\. These efforts have demonstrated that retrieval quality and evidence coverage play a critical role in downstream performance, and have led to significant gains on question answering and knowledge\-intensive benchmarks\.

More recent studies have focused on improving RAG robustness through better context selection, reranking, compression, or prompt engineering\(Donget al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib33); Ampazis,[2024](https://arxiv.org/html/2605.14192#bib.bib34)\)\. While these methods reduce failures at the system level, they primarily treat the language model as a black box and do not address how retrieved evidence is internally processed once injected into the context\. As a result, they offer limited insight into why errors persist even when relevant evidence is successfully retrieved\.

In the context of RAG, several works have proposed methods to assess faithfulness, evidence usage, and citation correctness\(Zenget al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib19); Liuet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib54)\)\. However, these approaches typically rely on representations from a single layer and provide only a largely static view of the model’s internal state, lacking a comprehensive and dynamic perspective for analyzing how evidence is integrated during computation\.

Interpretability and Circuit Analysis of LLMsParallel to advances in RAG, a growing body of work has investigated the internal mechanisms of transformer models using interpretability techniques such as attention analysis\(Clarket al\.,[2019](https://arxiv.org/html/2605.14192#bib.bib37); Vig and Belinkov,[2019](https://arxiv.org/html/2605.14192#bib.bib38)\), Sparse Autoencoders \(SAEs\)\(Cunninghamet al\.,[2023](https://arxiv.org/html/2605.14192#bib.bib39)\), transcoders\(Dunefskyet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib36)\), and circuit tracing\(Elhageet al\.,[2021](https://arxiv.org/html/2605.14192#bib.bib40)\)\. These studies suggest that specific reasoning behaviors can often be traced to localized subnetworks or circuits spanning multiple layers and attention heads\. A common theme across these methods is the decomposition of dense neural representations into more interpretable feature spaces\(Pauloet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib41); Ferrandoet al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib42)\), which in turn enables the construction of attribution graphs that model the causal flow of information within the network\(Markset al\.,[2024](https://arxiv.org/html/2605.14192#bib.bib43)\)\.

While circuit\-level analyses have provided valuable insights into question answering\(Daiet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib57); Zhaoet al\.,[2025](https://arxiv.org/html/2605.14192#bib.bib58)\)by verifying chain\-of\-thought reasoning through computational graphs, most prior work has been limited to standalone language models without retrieval\. As a result, the interaction between external evidence and internal circuits in RAG settings remains insufficiently understood\.

Positioning of Our Work\.Our work bridges these lines of research by providing a mechanistic, circuit\-level analysis of RAG systems\. Unlike prior RAG studies that emphasize retrieval quality or prompt design, we focus on how retrieved evidence propagates through internal model circuits\. By introducing attribution graphs, we move beyond scalar attribution scores and enable structural analysis of information flow between faithful and hallucinated generations\. This perspective allows us to identify integration failures inside the model as a key source of RAG failures, complementing and extending existing system\-level analyses\.

### A\.2\.Case Study

Figure[11](https://arxiv.org/html/2605.14192#A1.F11)provides representative examples illustrating how the intervention improves reasoning under mixed\-context conditions\. In both examples, the baseline model fails to complete the full reasoning chain: in one case it stops at a salient intermediate entity without performing the final role mapping, and in the other it fails to bridge a geographic clue to the relevant historical event\. After attention control is applied, the model follows a more complete reasoning path, successfully connects intermediate evidence, and produces the correct final answer\.

These qualitative examples show that the intervention does not merely alter surface\-level outputs\. Instead, it changes how the model integrates evidence across reasoning steps, improving its ability to link intermediate clues to the final answer\.

![Refer to caption](https://arxiv.org/html/2605.14192v1/figure/radar_qwen.png)Figure 8\.Radar comparison of attribution\-graph structural metrics between correct and incorrect predictions across three QA datasets \(2Wiki, HotpotQA, and MuSiQue\)\.![Refer to caption](https://arxiv.org/html/2605.14192v1/figure/hotpotqa_layer.png)Figure 9\.Layer\-wise attribution mass for correct and wrong predictions \(left\) and their difference on HotpotQA \(right\)\.![Refer to caption](https://arxiv.org/html/2605.14192v1/figure/2wiki_layer.png)Figure 10\.Layer\-wise attribution mass for correct and wrong predictions \(left\) and their difference on 2Wiki \(right\)\.
Figure 11\.Examples where routing control improves multi\-hop reasoning\. In both cases, the baseline model either stops at an intermediate entity or fails to bridge a geographic clue to a historical event\. After control, the model follows a more complete reasoning path and produces the correct answer\.

Similar Articles

GRACE-RAG: Governed Retrieval Architecture for Canonical Evidence Synthesis, Enabling Lightweight Deployment in Closed-Domain Institutional Settings

arXiv cs.AI

This paper introduces GRACE-RAG, a retrieval-governed, graph-augmented RAG architecture that externalizes structural reasoning from generation to a structured retrieval layer, enabling lightweight deployment in closed-domain institutional settings. Experiments show up to 20% quality gains with mid-scale models, reducing computational and latency footprint.

Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing

arXiv cs.CL

Skill-RAG is a failure-aware RAG framework that uses hidden-state probing and skill routing to diagnose and correct query-evidence misalignment in retrieval-augmented generation. The approach detects retrieval failures and selectively applies targeted skills (query rewriting, question decomposition, evidence focusing) to improve accuracy on hard cases and out-of-distribution datasets.