Large Vision-Language Models Get Lost in Attention

arXiv cs.AI 05/08/26, 04:00 AM Papers
Summary
This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.
arXiv:2605.05668v1 Announce Type: new Abstract: Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severe misallocation and redundancy in current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention'' rather than efficiently leveraging visual context.
Original Article
View Cached Full Text
Cached at: 05/08/26, 08:30 AM
# Large Vision–Language Models Get Lost in Attention
Source: [https://arxiv.org/html/2605.05668](https://arxiv.org/html/2605.05668)
Ye TianMengyu YangHuahui YiLiang LinXiaoshuai HaoKun WangWendong Wang

###### Abstract

Despite the rapid evolution of training paradigms, the decoder backbone of large vision–language models \(LVLMs\) remains fundamentally rooted in the residual\-connection Transformer architecture\. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization\. While prior statistical approaches have provided valuable attribution\-based insights, they often lack a unified theoretical basis\. To bridge this gap, we propose a unified framework grounded ininformation theory and geometryto quantify thegeometric and entropic natureof residual updates\. Applying this unified framework reveals a fundamental functional decoupling:Attention acts as a subspace\-preserving operatorfocused on reconfiguration, whereasFFNs serve as subspace\-expanding operatorsdriving semantic innovation\. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values \(e\.g\., Gaussian noise\) yields comparable or even superior performance across a majority of datasets relative to vanilla models\. These results expose severemisallocation and redundancyin current mechanisms, suggesting that state\-of\-the\-art LVLMs effectively “get lost in attention” rather than efficiently leveraging visual context\. Our code is publicly available at[this link](https://github.com/Lrbomchz/vlms_lost_in_attn)\.

Machine Learning, ICML

## 1Introduction

Large vision–language models \(LVLMs\) have rapidly evolved from large language models \(LLMs\) by extending Transformer\-based sequence modeling to jointly process natural language and visual signals\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.05668#bib.bib1)\)\. Early vision–language representation learning \(e\.g\., contrastive pretraining\) established strong image–text alignment that later LVLMs could leverage as a visual grounding interface\(Radfordet al\.,[2021](https://arxiv.org/html/2605.05668#bib.bib2)\)\. Subsequent LVLMs increasingly unify pretrained vision encoders with LLM backbones, enabling few\-shot multimodal generalization and instruction\-following behavior at scale\(Alayracet al\.,[2022](https://arxiv.org/html/2605.05668#bib.bib3); Liet al\.,[2023a](https://arxiv.org/html/2605.05668#bib.bib4); Liuet al\.,[2023](https://arxiv.org/html/2605.05668#bib.bib5); Haoet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib101)\)\. In parallel, reasoning\-oriented paradigms have further endowed these models with improved deliberation and problem\-solving behaviors\(Weiet al\.,[2022](https://arxiv.org/html/2605.05668#bib.bib7); Jaechet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib66); Guoet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib6); Zhanget al\.,[2025b](https://arxiv.org/html/2605.05668#bib.bib103); Tanet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib104)\)\. Despite the fast pace of architectural and training innovations, the dominant LVLM family remains fundamentally grounded in the Transformer architecture\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.05668#bib.bib1)\)\.

From an interpretability standpoint, the standard Transformer layer is composed of two core submodules, namely multi\-head self\-attention and feed\-forward network \(FFN\), and each submodule is wrapped by residual connections, so that every submodule produces an additive update that is written back into a shared residual stream representation\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.05668#bib.bib1); Elhageet al\.,[2021](https://arxiv.org/html/2605.05668#bib.bib8); Skeanet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib15)\)\. A common workinghypothesisis thatattention blocks are the primary substrate for in\-context reasoning, implementing context\-dependent algorithms such as induction/copy\-based mechanisms\(Olssonet al\.,[2022](https://arxiv.org/html/2605.05668#bib.bib9)\)\. In contrast,FFNs are often characterized as storing and retrieving distributional associations, behaving like key–value memories whose activated patterns can induce next\-token distributions that resemble shallow n\-gram continuations\(Gevaet al\.,[2021](https://arxiv.org/html/2605.05668#bib.bib10); Edelmanet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib11)\)\.

To probe this modularityhypothesis, attention interpretability work has largely taken astatistical perspectivethat treats attention related signals as measurable proxies and attributes function via empirical distributions\(Zhouet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib16); Kahardiprajaet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib17)\), correlations\(Jain and Wallace,[2019](https://arxiv.org/html/2605.05668#bib.bib12); Abnar and Zuidema,[2020](https://arxiv.org/html/2605.05668#bib.bib98)\), and controlled interventions\(Serrano and Smith,[2019](https://arxiv.org/html/2605.05668#bib.bib13); Namet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib18)\)\. More recently, this statistical toolkit has been extended tovisual attentionin LVLM decoders, where attention links text to visual tokens\. Empirical analyses reveal systematic phenomena such as*visual attention sink*\(Kanget al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib27)\)and*visual attention drift*\(Liuet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib28); Guanet al\.,[2026](https://arxiv.org/html/2605.05668#bib.bib102)\), which together indicate that models often under allocate attention to truly informative visual evidence\. Given these advances, LVLM module\-level interpretability still lacks a unifyinginformation theoretic and geometricframework that can characterize, and explicitly contrast, how different submodules contribute to representation structure in multimodal settings\. In contrast, the representation analysis literature for LLMs already uses such lenses to evaluate representation quality across depth\(Razzhigaevet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib30); Weiet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib14)\)and to study joint dynamics\(Skeanet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib15); Tianet al\.,[2023](https://arxiv.org/html/2605.05668#bib.bib29)\)\. This gap motivates bringing these principled lenses into LVLM analysis to address the missing perspective and enable module specific and modality grounded comparisons\.

To bridge this theoretical gap, we present a unified framework grounded ininformation theory and differential geometryto*quantify and contrast module\-level functional contributions*in LVLM residual\-stream computation\. By adopting the manifold hypothesis\(Bengioet al\.,[2013](https://arxiv.org/html/2605.05668#bib.bib42)\)for representation space, we introduce two complementary metrics:Representation Information Discrepancy \(RID\)andMixing Information Gain \(MixIG\)\. These metrics decompose the contribution of residual updates into two distinct geometric effects:innovation, which quantifies external information injection that expands the semantic subspace or alters spectral complexity, andreconfiguration, which measures the entropic redistribution of information within the existing support\. We conduct extensive experiments across 15 state\-of\-the\-art LVLMs spanning three dominant architectures on a broad suite of multimodal benchmarks\. Our analysis reveals two profound insights: first, we quantitatively validate a sharp functional decoupling in Transformer residual stream computation: attention primarily performs entropic*reconfiguration*that preserves the existing representation support, whereas FFNs dominate*innovation*by introducing new semantic directions\. Building on this division of labor, we further diagnose a systemic pathology in current LVLMs: decoder visual attention often fails to perform meaningful mixing over question\-relevant visual evidence, and instead exhibits substantial redundancy, frequently getting lost in interaction patterns with limited contribution to informative updates\.

Our main contributions are summarized as follows:

- •Theoretical Framework:We propose a rigorous formalism based on the manifold hypothesis to define representational information\. We introduce RID and MixIG as dual metrics to quantify the geometric and entropic impact of residual updates, offering a generalized tool for probing representation dynamics\.
- •Module\-level Interpretability:We provide a quantitative explanation of the distinct roles within Transformer blocks\. We demonstrate that Attention and FFNs operate in orthogonal regimes—reconfigurationversusinnovation—thereby substantiating the modularity hypothesis with geometric evidence\.
- •Empirical Diagnostics:We uncover critical inefficiencies in LVLM designs\. Our results highlight that despite architectural scaling, current models suffer from severe informational redundancy in visual processing, suggesting that the integration of visual tokens is often computationally expensive yet informationally sparse\.

## 2Related work

Interpretability of LLMs\.A large body of work studies what information is encoded in LLM representations and where it appears in the network\(Belinkov and Glass,[2019](https://arxiv.org/html/2605.05668#bib.bib20)\)\. Early work uses lightweight linear probes on intermediate hidden states\(Conneauet al\.,[2018](https://arxiv.org/html/2605.05668#bib.bib21); Hewitt and Manning,[2019](https://arxiv.org/html/2605.05668#bib.bib22); Belroseet al\.,[2023](https://arxiv.org/html/2605.05668#bib.bib23)\)\. Subsequent decoding based efforts, such as the tuned lens, map hidden states to vocabulary distributions\(Belroseet al\.,[2023](https://arxiv.org/html/2605.05668#bib.bib23)\)\. Alongside probing and decoding, sparse feature learning approaches, including transcoders\(Dunefskyet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib24)\)and sparse autoencoders\(Cunninghamet al\.,[2023](https://arxiv.org/html/2605.05668#bib.bib32)\), map representations into a sparse and more discrete feature space\(Ameisenet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib31)\)\.

![Refer to caption](https://arxiv.org/html/2605.05668v1/x1.png)

Figure 1:Overview of Our Interpretability Framework:\(a\)the LVLM residual stream;\(b\)representation information in𝐗\\mathbf\{X\}, where SVD yields Spectrum𝒮𝐗\\mathcal\{S\}\_\{\\mathbf\{X\}\}and semantic support𝒟𝐗\\mathcal\{D\}\_\{\\mathbf\{X\}\};\(c\)update\-level effects ofΔ𝐗\\Delta\\mathbf\{X\}, quantified byRIDfor innovation andMixIGfor reconfiguration;\(d\)layer\-wise functional decomposition, revealing an orthogonal division of labor where attention behaves as a subspace\-preserving operator and FFNs act as subspace\-expanding operators\.Module Interpretability\.Module interpretability asks whether internal Transformer components provide meaningful explanations of model behavior\. For attention, foundational studies show that raw attention weights can be an unreliable attribution signal\(Jain and Wallace,[2019](https://arxiv.org/html/2605.05668#bib.bib12); Serrano and Smith,[2019](https://arxiv.org/html/2605.05668#bib.bib13); Wiegreffe and Pinter,[2019](https://arxiv.org/html/2605.05668#bib.bib25)\)\. To better capture how attention\-mediated influence accumulates, attention rollout and attention flow estimate propagation across layers\(Abnar and Zuidema,[2020](https://arxiv.org/html/2605.05668#bib.bib98); Kimet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib33)\)\. More recent work moves beyond token\-level importance to head\-level functionality by combining dataset\-grounded attribution with causal validation\(Namet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib18); Kahardiprajaet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib17); Zhouet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib16); Duet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib34)\)\. Complementarily, parameter\-based approaches infer head functionality without per\-prompt inference traces\(Elhelo and Geva,[2025](https://arxiv.org/html/2605.05668#bib.bib19)\)\. In parallel, module\-oriented analyses show that Feed\-Forward layers can act as key–value memories\(Gevaet al\.,[2021](https://arxiv.org/html/2605.05668#bib.bib10); Qiuet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib67)\)\. By contrast, our work provides a unified information\-theoretic and geometric framework that quantifies how different residual\-stream updates contribute via innovation versus reconfiguration, enabling direct, module\-wise comparison beyond attribution alone\.

Information theory in LLM interpretability\.Information\-theoretic views frame interpretability in terms of information preservation, compression, and redundancy in representations\. One line focuses on representation quality evaluation, using information and geometry motivated measures such as entropy, rank based quantities to assess whether embeddings preserve task relevant structure\(Agrawalet al\.,[2022](https://arxiv.org/html/2605.05668#bib.bib35); Deb and Ogunfunmi,[2025](https://arxiv.org/html/2605.05668#bib.bib26); Liet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib80)\)\. A second line uses these measures for layerwise analyses, aiming to characterize how representational properties change across the network\(Skeanet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib15); Aliet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib36)\)\. A third line emphasizes compression and redundancy reduction as a model level capability that can correlate with performance and scaling trends\(Weiet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib14); Yuet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib37); Havrilla and Liao,[2024](https://arxiv.org/html/2605.05668#bib.bib38)\)\. However, existing information theoretic work rarely provides*module\-level interpretability*for module itself\(Laiet al\.,[2021](https://arxiv.org/html/2605.05668#bib.bib39)\), especially in the LVLM setting\.

Overall, we connect*module\-level residual\-stream updates*in LVLMs to information theory and geometry by*operationalizing*each update as an observable innovation–reconfiguration decomposition on representations\. This framework turns prior statistically grounded module\-level functional attributions into measurable information\-flow statements, and it reveals that attention scores in current LVLMs contain substantial redundancy\. Specifically, we replace part of the learned attention scores with random noise and find that model performance is largely preserved, even though this scoring step is a major computational bottleneck in standard self\-attention, whose cost scales quadratically with sequence length\.

## 3A Unified Interpretability Framework for the Residual Stream

In this section, we first introduce the notation and research questions in Sec\.[3\.1](https://arxiv.org/html/2605.05668#S3.SS1)\. We then formalize representation information from an information\-theoretic and geometric perspective in Sec\.[3\.2](https://arxiv.org/html/2605.05668#S3.SS2)\. Finally, in Sec\.[3\.3](https://arxiv.org/html/2605.05668#S3.SS3), we develop quantitative metrics for evaluating residual\-stream updates\.

### 3\.1Preliminaries

#### 3\.1\.1Motivation and Notation

Consider an inputℐ\\mathcal\{I\}, for example, a sequence of visual and language tokens\. A multi\-module neural network mapsℐ\\mathcal\{I\}to a hidden\-state matrix𝐗∈ℝS×H\\mathbf\{X\}\\in\\mathbb\{R\}^\{S\\times H\}, whereSSis the token length andHHis the hidden dimension\. Throughout the forward pass, the representation is updated via residual connections\. At each step, a module produces an additive updateΔ𝐗\\Delta\\mathbf\{X\}, yielding𝐗new=𝐗old\+Δ𝐗\.\\mathbf\{X\}\_\{\\text\{new\}\}=\\mathbf\{X\}\_\{\\text\{old\}\}\+\\Delta\\mathbf\{X\}\.This residual\-update view raises three progressively refined questions:

1. RQ1:How should we quantify the information contained in a representation𝐗\\mathbf\{X\}?
2. RQ2:How should we quantify whatΔ𝐗\\Delta\\mathbf\{X\}contributes to𝐗\\mathbf\{X\}?
3. RQ3:How can we useΔ𝐗\\Delta\\mathbf\{X\}to analyze and contrast the functional roles of different modules?

Answering these questions provides a principled foundation for tracking information flow across layers, characterizing when module updates are informative versus redundant, and understanding how different modules shape multimodal representations during inference\. For clarity, we summarize the notation used throughout the paper inAppendixTable[3](https://arxiv.org/html/2605.05668#A1.T3)\.

#### 3\.1\.2Residual Stream and Attention in LVLMs

We next specify the LVLM setting and introduce the residual\-stream view of Transformer decoding\.

Large Vision–Language Models \(LVLMs\)\.We consider decoder\-style LVLMs that process multimodal inputs by converting them into a single token sequence\. Concretely, an image is encoded by a visual encoder and mapped through a modality projector into a sequence of visual tokens𝐗\(v\)∈ℝSv×H\\mathbf\{X\}^\{\(v\)\}\\in\\mathbb\{R\}^\{S\_\{v\}\\times H\}\. Textual inputs are tokenized into system and user tokens𝐗\(s\)∈ℝSs×H\\mathbf\{X\}^\{\(s\)\}\\in\\mathbb\{R\}^\{S\_\{s\}\\times H\}and𝐗\(q\)∈ℝSq×H\\mathbf\{X\}^\{\(q\)\}\\in\\mathbb\{R\}^\{S\_\{q\}\\times H\}\. We denote the concatenated input sequence by

𝐗\(c\)=\[𝐗\(s\),𝐗\(v\),𝐗\(q\)\]∈ℝSc×H,Sc=Ss\+Sv\+Sq\.\\mathbf\{X\}^\{\(c\)\}=\\big\[\\mathbf\{X\}^\{\(s\)\},\\,\\mathbf\{X\}^\{\(v\)\},\\,\\mathbf\{X\}^\{\(q\)\}\\big\]\\in\\mathbb\{R\}^\{S\_\{c\}\\times H\},\\quad S\_\{c\}=S\_\{s\}\+S\_\{v\}\+S\_\{q\}\.At decoding steptt, the model generates an output tokenyty\_\{t\}from

p\(yt∣𝐗\(c\),𝐲<t\),𝐲<t=\{yi\}i=1t−1,p\(y\_\{t\}\\mid\\mathbf\{X\}^\{\(c\)\},\\mathbf\{y\}\_\{<t\}\),\\quad\\mathbf\{y\}\_\{<t\}=\\\{y\_\{i\}\\\}\_\{i=1\}^\{t\-1\},where𝐲<t\\mathbf\{y\}\_\{<t\}determines the autoregressive context and𝐗\(c\)\\mathbf\{X\}^\{\(c\)\}provides the multimodal conditioning\.

Attention in LVLM decoders\.Let the decoder haveLLTransformer layers\. At each layerlland decoding steptt, causal multi\-head attention produces a normalized distribution over the*available*tokens, i\.e\., the concatenation ofScS\_\{c\}context tokens \(system, visual, and question tokens\) and the\(t−1\)\(t\-1\)previously generated tokens\. We denote the total attention domain size bySt=Sc\+\(t−1\)\.S\_\{t\}=S\_\{c\}\+\(t\-1\)\.The attention distribution at stepttis𝐚tl∈\[0,1\]St\\mathbf\{a\}^\{\\,l\}\_\{t\}\\in\[0,1\]^\{S\_\{t\}\}with∑i=1Stat,il=1\\sum\_\{i=1\}^\{S\_\{t\}\}a^\{\\,l\}\_\{t,i\}=1\. Concretely, letting𝐪tl∈ℝdk\\mathbf\{q\}^\{\\,l\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{k\}\}be the query at stepttand𝐊tl∈ℝSt×dk\\mathbf\{K\}^\{l\}\_\{t\}\\in\\mathbb\{R\}^\{S\_\{t\}\\times d\_\{k\}\}be the key matrix formed from all available tokens up to stepttat layerll, we write

𝐚tl=softmax\(𝐊tl𝐪tldk\),𝐚tl∈\[0,1\]St,\\mathbf\{a\}^\{\\,l\}\_\{t\}=\\mathrm\{softmax\}\\\!\\left\(\\frac\{\\mathbf\{K\}^\{l\}\_\{t\}\\mathbf\{q\}^\{\\,l\}\_\{t\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\),\\qquad\\mathbf\{a\}^\{\\,l\}\_\{t\}\\in\[0,1\]^\{S\_\{t\}\},which records, for each decoding step and layer, how the decoder allocates attention over*available*tokens\.

Residual StreamFollowing the mathematical interpretation of the residual stream inElhageet al\.\([2021](https://arxiv.org/html/2605.05668#bib.bib8)\), we view the layerwise hidden states as a residual stream that evolves via additive updates from each module\. In our notation, the representation matrix at layerllsatisfies

𝐗inl\+1=𝐗inl\+Δ𝐗attnl\+Δ𝐗ffnl,𝐗l∈ℝS×H\.\\mathbf\{X\}^\{\\,l\+1\}\_\{\\mathrm\{in\}\}=\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{in\}\}\+\\Delta\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{attn\}\}\+\\Delta\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{ffn\}\},\\qquad\\mathbf\{X\}^\{\\,l\}\\in\\mathbb\{R\}^\{S\\times H\}\.

#### 3\.1\.3Theoretical Foundations

In this subsection, we introduce our foundational assumptions and the mathematical tools used to characterize a representation matrix𝐗∈ℝS×H\\mathbf\{X\}\\in\\mathbb\{R\}^\{S\\times H\}\.

###### Assumption 3\.1\(Manifold hypothesis\(Bengioet al\.,[2013](https://arxiv.org/html/2605.05668#bib.bib42)\)\)\.

Learned representations often concentrate near a low\-dimensional structure embedded in a high\-dimensional ambient space\. This assumption motivates using low\-rank spectral structure as a meaningful proxy for the effective degrees of freedom of𝐗\\mathbf\{X\}\. It also underpins a growing body of representation\-centric studies in modern deep models\(Wanget al\.,[2024a](https://arxiv.org/html/2605.05668#bib.bib73); Basileet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib75); Gardinazziet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib76); Nishiet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib77)\)\.

###### Definition 3\.2\(Frobenius norm\)\.

For𝐗∈ℝS×H\\mathbf\{X\}\\in\\mathbb\{R\}^\{S\\times H\},

‖𝐗‖F=\(∑s=1S∑h=1H𝐗s,h2\)12=tr\(𝐗⊤𝐗\)=\(∑i=1Qσi2\)12\.\\\|\\mathbf\{X\}\\\|\_\{F\}=\\Big\(\\sum\_\{s=1\}^\{S\}\\sum\_\{h=1\}^\{H\}\\mathbf\{X\}\_\{s,h\}^\{2\}\\Big\)^\{\\frac\{1\}\{2\}\}=\\sqrt\{\\mathrm\{tr\}\(\\mathbf\{X\}^\{\\top\}\\mathbf\{X\}\)\}=\\Big\(\\sum\_\{i=1\}^\{Q\}\\sigma\_\{i\}^\{2\}\\Big\)^\{\\frac\{1\}\{2\}\}\.It measures the total energy of𝐗\\mathbf\{X\}in the ambient space\.

### 3\.2Geometric Characterization of Representation Information on Matrix Manifolds \(RQ1\)

In what follows, we progressively answer the three research questions posed in Section[3\.1\.1](https://arxiv.org/html/2605.05668#S3.SS1.SSS1)\.RQ1asks:*How should we quantify the information contained in a representation𝐗\\mathbf\{X\}?*To quantify the information in𝐗\\mathbf\{X\}, we adopt a geometric perspective based on the fixed\-rank matrix manifold\.

From differential geometry, the set of matrices with rankrr,

ℳr=\{𝐗∈ℝS×H:rank⁡\(𝐗\)=r\},\\mathcal\{M\}\_\{r\}=\\\{\\mathbf\{X\}\\in\\mathbb\{R\}^\{S\\times H\}:\\operatorname\{rank\}\(\\mathbf\{X\}\)=r\\\},admits a smooth Riemannian manifold structure \(as an embedded submanifold in the ambient Euclidean space of matrices\)\(Vandereycken,[2013](https://arxiv.org/html/2605.05668#bib.bib47)\)\. For any𝐗∈ℳr\\mathbf\{X\}\\in\\mathcal\{M\}\_\{r\}, a compact singular value decomposition parameterizes𝐗\\mathbf\{X\}as𝐗=𝐔𝚺𝐕⊤\\mathbf\{X\}=\\mathbf\{U\}\\mathbf\{\\Sigma\}\\mathbf\{V\}^\{\\top\}withrrpositive singular values:

###### Definition 3\.3\(Singular Value Decomposition\(Golub and Van Loan,[2013](https://arxiv.org/html/2605.05668#bib.bib40)\)\)\.

For any𝐗∈ℝS×H\\mathbf\{X\}\\in\\mathbb\{R\}^\{S\\times H\}, letQ=min⁡\{S,H\}Q=\\min\\\{S,H\\\}\. The SVD of𝐗\\mathbf\{X\}is

𝐗=𝐔𝚺𝐕⊤=∑i=1Qσi𝐮i𝐯i⊤,\\mathbf\{X\}=\\mathbf\{U\}\\mathbf\{\\Sigma\}\\mathbf\{V\}^\{\\top\}=\\sum\_\{i=1\}^\{Q\}\\sigma\_\{i\}\\mathbf\{u\}\_\{i\}\\mathbf\{v\}\_\{i\}^\{\\top\},where𝐔∈ℝS×Q\\mathbf\{U\}\\in\\mathbb\{R\}^\{S\\times Q\}and𝐕∈ℝH×Q\\mathbf\{V\}\\in\\mathbb\{R\}^\{H\\times Q\}have orthonormal columns,𝚺=diag\(σ1,…,σQ\)\\mathbf\{\\Sigma\}=\\mathrm\{diag\}\(\\sigma\_\{1\},\\ldots,\\sigma\_\{Q\}\)withσ1≥⋯≥σQ≥0\\sigma\_\{1\}\\geq\\cdots\\geq\\sigma\_\{Q\}\\geq 0, and\(𝐮i,𝐯i\)\(\\mathbf\{u\}\_\{i\},\\mathbf\{v\}\_\{i\}\)are the left and right singular vectors\.

Under this parameterization,𝐗\\mathbf\{X\}is described by three geometric objects:

- •Left singular subspace𝒞\(𝐗\)=span⁡\(𝐔\)∈Gr⁡\(r,S\)\\mathcal\{C\}\(\\mathbf\{X\}\)=\\operatorname\{span\}\(\\mathbf\{U\}\)\\in\\operatorname\{Gr\}\(r,S\), capturing association structure in the token space;
- •Right singular subspaceℛ\(𝐗\)=span⁡\(𝐕\)∈Gr⁡\(r,H\)\\mathcal\{R\}\(\\mathbf\{X\}\)=\\operatorname\{span\}\(\\mathbf\{V\}\)\\in\\operatorname\{Gr\}\(r,H\), capturing semantic directions in the feature space;
- •Singular spectrum𝚺∈ℝ\+r\\mathbf\{\\Sigma\}\\in\\mathbb\{R\}\_\{\+\}^\{r\}, capturing the energy distribution across principal directions\.

HereGr⁡\(r,n\)\\operatorname\{Gr\}\(r,n\)denotes the Grassmann manifold, the set of allrr\-dimensional linear subspaces ofℝn\\mathbb\{R\}^\{n\}\(Absilet al\.,[2008](https://arxiv.org/html/2605.05668#bib.bib48)\)\.

Motivated by this geometry, we formalize the information contained in𝐗\\mathbf\{X\}as a pair

ℐ\(𝐗\)=\(𝒮𝐗,𝒟𝐗\)\.\\mathcal\{I\}\(\\mathbf\{X\}\)=\\big\(\\mathcal\{S\}\_\{\\mathbf\{X\}\},\\mathcal\{D\}\_\{\\mathbf\{X\}\}\\big\)\.Here𝒮𝐗\\mathcal\{S\}\_\{\\mathbf\{X\}\}denotes the*information complexity*, determined by the singular spectrum, and𝒟𝐗\\mathcal\{D\}\_\{\\mathbf\{X\}\}denotes the*information support*, determined by the left and right subspaces\. We detail these two components next\.

#### 3\.2\.1Information complexity \(Spectrum𝒮𝐗\\mathcal\{S\}\_\{\\mathbf\{X\}\}\)

Based on Theorem[F\.2](https://arxiv.org/html/2605.05668#A6.Thmtheorem2)\(Eckart and Young,[1936](https://arxiv.org/html/2605.05668#bib.bib41)\), the singular values determine the optimal rank\-kkapproximation error and therefore quantify how much of𝐗\\mathbf\{X\}can be captured by its leading principal directions\. We thus summarize the concentration versus spread of the singular spectrum into an effective dimensionality using*effective rank*\(eRank\):

###### Definition 3\.4\(Rank and Effective rank\(Roy and Vetterli,[2007](https://arxiv.org/html/2605.05668#bib.bib43)\)\)\.

For𝐗\\mathbf\{X\}with singular values\{σi\}i=1Q\\\{\\sigma\_\{i\}\\\}\_\{i=1\}^\{Q\}, the rank is

rank\(𝐗\)=\|\{i:σi\>0\}\|\.\\mathrm\{rank\}\(\\mathbf\{X\}\)=\\big\|\\\{i:\\sigma\_\{i\}\>0\\\}\\big\|\.Letpi=σi/∑σp\_\{i\}=\\sigma\_\{i\}\\big/\\sum\\sigmabe the normalized singular spectrum\. We define the Spectrum𝒮𝐗\\mathcal\{S\}\_\{\\mathbf\{X\}\}of the matrix as its effective rank:

𝒮𝐗=eRank\(𝐗\)=exp⁡\(−∑i=1Qpilog⁡pi\)\.\\mathcal\{S\}\_\{\\mathbf\{X\}\}=\\mathrm\{eRank\}\(\\mathbf\{X\}\)=\\exp\\\!\\Big\(\-\\sum\_\{i=1\}^\{Q\}p\_\{i\}\\log p\_\{i\}\\Big\)\.

This quantity corresponds to the*scale*component in the SVD\-based representation, namely the singular spectrum𝚺\\mathbf\{\\Sigma\}\.

#### 3\.2\.2Information support \(Support𝒟𝐗\\mathcal\{D\}\_\{\\mathbf\{X\}\}\)

This component corresponds to the Grassmann points𝒞\(𝐗\)\\mathcal\{C\}\(\\mathbf\{X\}\)andℛ\(𝐗\)\\mathcal\{R\}\(\\mathbf\{X\}\)in the manifold parameterization\. We view “semantics” as the linear subspaces occupied by the data in the ambient vector spaces; under the manifold hypothesis, high\-dimensional semantic structure often concentrates near low\-dimensional subspaces\. Concretely, the column space𝒞\(𝐗\)\\mathcal\{C\}\(\\mathbf\{X\}\)\(spanned by𝐔\\mathbf\{U\}\) specifies what semantic categories the layer representation can express, while the row spaceℛ\(𝐗\)\\mathcal\{R\}\(\\mathbf\{X\}\)\(spanned by𝐕\\mathbf\{V\}\) specifies linear dependency structure among tokens\. In practice, we parameterize these Grassmann points using the orthonormal bases from SVD via the associated orthogonal projectors:

𝐏𝒞\(𝐗\)=𝐔𝐔⊤,𝐏ℛ\(𝐗\)=𝐕𝐕⊤,𝒟𝐗=\(𝐏𝒞\(𝐗\),𝐏ℛ\(𝐗\)\)\\mathbf\{P\}\_\{\\mathcal\{C\}\(\\mathbf\{X\}\)\}=\\mathbf\{U\}\\mathbf\{U\}^\{\\top\},\\;\\mathbf\{P\}\_\{\\mathcal\{R\}\(\\mathbf\{X\}\)\}=\\mathbf\{V\}\\mathbf\{V\}^\{\\top\},\\;\\mathcal\{D\}\_\{\\mathbf\{X\}\}=\(\\mathbf\{P\}\_\{\\mathcal\{C\}\(\\mathbf\{X\}\)\},\\mathbf\{P\}\_\{\\mathcal\{R\}\(\\mathbf\{X\}\)\}\)which uniquely determine the supporting subspaces of𝐗\\mathbf\{X\}\.

Discussion\.We have thus answeredRQ1by formalizing the information contained in a representation𝐗\\mathbf\{X\}as two complementary components: the singular spectrum𝚺\\mathbf\{\\Sigma\}encodes how energy is distributed across principal directions and thereby quantifies information complexity𝒮𝐗\\mathcal\{S\}\_\{\\mathbf\{X\}\}, while the orthonormal factors\(𝐔,𝐕\)\(\\mathbf\{U\},\\mathbf\{V\}\)determine the supporting subspaces𝒞\(𝐗\)=span\(𝐔\)\\mathcal\{C\}\(\\mathbf\{X\}\)=\\mathrm\{span\}\(\\mathbf\{U\}\)andℛ\(𝐗\)=span\(𝐕\)\\mathcal\{R\}\(\\mathbf\{X\}\)=\\mathrm\{span\}\(\\mathbf\{V\}\), fixing the geometric orientation of the representation in token and feature spaces and capturing structured semantics𝒟𝐗\\mathcal\{D\}\_\{\\mathbf\{X\}\}\.

### 3\.3Quantifying the Contribution of an UpdateΔ𝐗\\Delta\\mathbf\{X\}\(RQ2\)

In Section[3\.2](https://arxiv.org/html/2605.05668#S3.SS2), we answeredRQ1by defining the information in a representation asℐ\(𝐗\)=\(𝒮𝐗,𝒟𝐗\)\\mathcal\{I\}\(\\mathbf\{X\}\)=\(\\mathcal\{S\}\_\{\\mathbf\{X\}\},\\mathcal\{D\}\_\{\\mathbf\{X\}\}\)\. We now addressRQ2:*How should we quantify whatΔ𝐗\\Delta\\mathbf\{X\}contributes to𝐗\\mathbf\{X\}?*Given an additive update𝐗′=𝐗\+Δ𝐗\\mathbf\{X\}^\{\\prime\}=\\mathbf\{X\}\+\\Delta\\mathbf\{X\}, its effect on𝐗\\mathbf\{X\}admits three complementary and collectively exhaustive categories under our decomposition:

1. 1\.Spectrum change\(change in𝒮𝐗\\mathcal\{S\}\_\{\\mathbf\{X\}\}\):Δ𝐗\\Delta\\mathbf\{X\}reshapes the singular spectrum, inducing compression or expansion of the effective dimensionality, which reflects how information mass is redistributed across principal directions\.
2. 2\.Support change\(change in𝒟𝐗\\mathcal\{D\}\_\{\\mathbf\{X\}\}\):Δ𝐗\\Delta\\mathbf\{X\}perturbs the column and row subspaces, introducing or removing semantic support directions, namely a geometric shift in what the representation can express and how tokens linearly depend on one another\.
3. 3\.Internal interaction\(no external support\):Δ𝐗\\Delta\\mathbf\{X\}remains within the existing support and acts by*reconfiguration*, namely reorganizing and reallocating information already present in𝐗\\mathbf\{X\}without injecting new support directions\.

The first two categories reflect external information injection that changes complexity or support\. The third captures*reconfiguration*, since it reflects internal redistribution within the existing information support\. We next define measures for external information injection and reconfiguration\.

#### 3\.3\.1Measuring External Information Injection

Spectrum change\.We quantify the spectrum change by the eRank variation, normalized to lie in\[0,1\]\[0,1\]\.

Δ𝒮\(𝐗∣𝐗′\)=\|eRank\(𝐗′\)−eRank\(𝐗\)\|min⁡\{S,H\}\.\\Delta\\mathcal\{S\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)=\\frac\{\\big\|\\mathrm\{eRank\}\(\\mathbf\{X\}^\{\\prime\}\)\-\\mathrm\{eRank\}\(\\mathbf\{X\}\)\\big\|\}\{\\min\\\{S,H\\\}\}\.
Support innovation\.To measure how much new support is introduced byΔ𝐗\\Delta\\mathbf\{X\}, we use the innovation view from least squares, where innovation is the residual after projecting onto a reference subspace:

###### Definition 3\.5\(Subspace Innovation\)\.

Let𝒰⊆ℝd\\mathcal\{U\}\\subseteq\\mathbb\{R\}^\{d\}be a linear subspace with orthogonal projector𝐏𝒰\\mathbf\{P\}\_\{\\mathcal\{U\}\}\. For an observation𝐲∈ℝd\\mathbf\{y\}\\in\\mathbb\{R\}^\{d\}, the least\-squares prediction in𝒰\\mathcal\{U\}is𝐲^=𝐏𝒰𝐲\\hat\{\\mathbf\{y\}\}=\\mathbf\{P\}\_\{\\mathcal\{U\}\}\\mathbf\{y\}\. The innovation is the residual\(Hassibiet al\.,[2000](https://arxiv.org/html/2605.05668#bib.bib46)\)

𝐲~=𝐲−𝐲^=\(𝐈−𝐏𝒰\)𝐲\.\\tilde\{\\mathbf\{y\}\}=\\mathbf\{y\}\-\\hat\{\\mathbf\{y\}\}=\(\\mathbf\{I\}\-\\mathbf\{P\}\_\{\\mathcal\{U\}\}\)\\mathbf\{y\}\.

Analogously, we define the*support innovation*of the updateΔ𝐗\\Delta\\mathbf\{X\}relative to𝐗\\mathbf\{X\}as the energy that lies in the orthogonal complements of the column and row spaces of𝐗\\mathbf\{X\}\. Let𝐏𝒞\(𝐗\)\\mathbf\{P\}\_\{\\mathcal\{C\}\(\\mathbf\{X\}\)\}and𝐏ℛ\(𝐗\)\\mathbf\{P\}\_\{\\mathcal\{R\}\(\\mathbf\{X\}\)\}be the orthogonal projectors onto𝒞\(𝐗\)\\mathcal\{C\}\(\\mathbf\{X\}\)andℛ\(𝐗\)\\mathcal\{R\}\(\\mathbf\{X\}\)\. We define

Δ𝒟\(𝐗∣𝐗′\)=‖\(𝐈−𝐏𝒞\(𝐗\)\)𝐗′‖F\+‖𝐗′\(𝐈−𝐏ℛ\(𝐗\)\)‖F2×‖𝐗′‖F\.\\Delta\\mathcal\{D\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)=\\frac\{\\big\\\|\(\\mathbf\{I\}\-\\mathbf\{P\}\_\{\\mathcal\{C\}\(\\mathbf\{X\}\)\}\)\\mathbf\{X\}^\{\\prime\}\\big\\\|\_\{F\}\+\\big\\\|\\mathbf\{X\}^\{\\prime\}\(\\mathbf\{I\}\-\\mathbf\{P\}\_\{\\mathcal\{R\}\(\\mathbf\{X\}\)\}\)\\big\\\|\_\{F\}\}\{2\\times\\\|\\mathbf\{X\}^\{\\prime\}\\\|\_\{F\}\}\.
##### Two\-dimensional innovation vector\.

The two terms above capture complementary channels of external information injection\. Spectrum changeΔ𝒮\\Delta\\mathcal\{S\}measures variation in effective dimensionality, while support innovationΔ𝒟\\Delta\\mathcal\{D\}measures novelty in the column and row subspaces\. We therefore first represent innovation as a two\-dimensional quantity:

Δℐ\(𝐗∣𝐗′\)=\(Δ𝒮\(𝐗∣𝐗′\),Δ𝒟\(𝐗∣𝐗′\)\)\.\\Delta\\mathcal\{I\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)=\\big\(\\Delta\\mathcal\{S\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\),\\Delta\\mathcal\{D\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)\\big\)\.Using either component alone may miss complementary cases, such as subspace change with little spectral variation\. Since both components are normalized to comparable ranges, we aggregate them into a scalar summary score, defined next\.

###### Definition 3\.6\(Representation Information Discrepancy \(RID\)\)\.

Given two representation matrices𝐗,𝐗′∈ℝS×H\\mathbf\{X\},\\mathbf\{X\}^\{\\prime\}\\in\\mathbb\{R\}^\{S\\times H\}, we define the*Representation Information Discrepancy*as the sum of the spectrum change and the support innovation:

RID\(𝐗∣𝐗′\)=Δ𝒮\(𝐗∣𝐗′\)\+Δ𝒟\(𝐗∣𝐗′\)\.\\mathrm\{RID\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)=\\Delta\\mathcal\{S\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)\\;\+\\;\\Delta\\mathcal\{D\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)\.RID measures how a representation changes in spectral complexity and subspace novelty, and satisfiesRID∈\[0,2\]\\mathrm\{RID\}\\in\[0,2\]\(Lemma[F\.1](https://arxiv.org/html/2605.05668#A6.Thmtheorem1)\)\. Since positional encoding and parameterization effects makeRID\\mathrm\{RID\}rarely exactly zero in practice, we introduce a toleranceϵ\>0\\epsilon\>0and treat𝐗′\\mathbf\{X\}^\{\\prime\}as information\-preserving relative to𝐗\\mathbf\{X\}wheneverRID\(𝐗∣𝐗′\)≈ϵ\\mathrm\{RID\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)\\approx\\epsilon; concretely, we setϵRoPE=RID\(𝐗in\(RoPE\)\|𝐗in\(no\-RoPE\)\),\\epsilon\_\{\\text\{RoPE\}\}\\;=\\;\\mathrm\{RID\}\\\!\\Big\(\\mathbf\{X\}^\{\\text\{\(RoPE\)\}\}\_\{\\mathrm\{in\}\}\\;\\big\|\\;\\mathbf\{X\}^\{\\text\{\(no\-RoPE\)\}\}\_\{\\mathrm\{in\}\}\\Big\),which calibratesϵ\\epsilonto the intrinsic discrepancy induced by Rotary Positional Encoding \(RoPE\)\(Suet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib78)\)\.

#### 3\.3\.2Measuring Reconfiguration

Another effect ofΔ𝐗\\Delta\\mathbf\{X\}is*reconfiguration*, namely redistributing information within the existing support\. We measure this internal redistribution via a token\-to\-token mixing entropy\.

###### Definition 3\.7\(Token Mixing Entropy \(TME\)\)\.

Given a hidden\-state matrix𝐗∈ℝS×H\\mathbf\{X\}\\in\\mathbb\{R\}^\{S\\times H\}with row vectors𝐱t∈ℝH\\mathbf\{x\}\_\{t\}\\in\\mathbb\{R\}^\{H\}, define𝐱~t=𝐱t/‖𝐱t‖2\\tilde\{\\mathbf\{x\}\}\_\{t\}=\\mathbf\{x\}\_\{t\}/\\\|\\mathbf\{x\}\_\{t\}\\\|\_\{2\}as the unit direction vector\. We form a token\-to\-token mixing distribution by mapping pairwise token similarities to\[0,1\]\[0,1\]and then row\-normalizing

Pt,j=𝐱~t⊤𝐱~j\+12∑k=1S𝐱~t⊤𝐱~k\+12,t,j∈\{1,…,S\}\.P\_\{t,j\}=\\frac\{\\frac\{\\tilde\{\\mathbf\{x\}\}\_\{t\}^\{\\top\}\\tilde\{\\mathbf\{x\}\}\_\{j\}\+1\}\{2\}\}\{\\sum\_\{k=1\}^\{S\}\\frac\{\\tilde\{\\mathbf\{x\}\}\_\{t\}^\{\\top\}\\tilde\{\\mathbf\{x\}\}\_\{k\}\+1\}\{2\}\},\\qquad t,j\\in\\\{1,\\ldots,S\\\}\.The Token Mixing Entropy is the average Shannon entropy of these distributions:

TME\(𝐗\)=−1S∑t=1S∑j=1SPt,jlog⁡Pt,j\.\\mathrm\{TME\}\(\\mathbf\{X\}\)=\-\\frac\{1\}\{S\}\\sum\_\{t=1\}^\{S\}\\sum\_\{j=1\}^\{S\}P\_\{t,j\}\\log P\_\{t,j\}\.TME\(𝐗\)\\mathrm\{TME\}\(\\mathbf\{X\}\)provides an operational measure of token\-level interaction by summarizing how broadly each token mixes with the rest of the sequence\. It constructs a token\-to\-token mixing distribution from pairwise similarity and quantifies its uncertainty via entropy, solargerTME\\mathrm\{TME\}indicates more diffuse, globally shared interactions, whereas smallerTME\\mathrm\{TME\}indicates more concentrated, selective mixing\.

###### Definition 3\.8\(Mixing Information Gain \(MixIG\)\)\.

For an updated representation𝐗′=𝐗\+Δ𝐗\\mathbf\{X\}^\{\\prime\}=\\mathbf\{X\}\+\\Delta\\mathbf\{X\}, we define the mixing information gain as the change in token mixing entropy:

MixIG\(𝐗∣𝐗′\)=TME\(𝐗′\)−TME\(𝐗\)\.\\mathrm\{MixIG\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)=\\mathrm\{TME\}\(\\mathbf\{X\}^\{\\prime\}\)\-\\mathrm\{TME\}\(\\mathbf\{X\}\)\.This quantity captures how strongly the update increases or decreases token\-to\-token mixing, and thus serves as an operational measure of*reconfiguration*within the existing information support\.

Discussion\.In this section, we answerRQ2with two complementary metrics:RIDandMixIG\.RIDquantifies*innovation*by measuring howΔ𝐗\\Delta\\mathbf\{X\}changes the representation through spectral complexity shifts and support novelty, indicating external information injection beyond the current subspace\.MixIGquantifies*reconfiguration*by measuring howΔ𝐗\\Delta\\mathbf\{X\}reshapes token to token mixing within the existing support, capturing internal redistribution of information without introducing new support directions\.

## 4Redundancy and Misallocation in LVLM Visual Attention \(RQ3\)

In this section, we build on our theoretical framework to answerRQ3:*How can we useΔ𝐗\\Delta\\mathbf\{X\}to analyze and contrast the functional roles of different modules?*Through experiments, we uncover a common pathology in Transformer\-based LVLMs:models can get lost in attention\. We first describe the experimental setups in Section[4\.1](https://arxiv.org/html/2605.05668#S4.SS1)\. Then, in Section[4\.2](https://arxiv.org/html/2605.05668#S4.SS2), we useRIDandMixIGto show that different modules exhibit orthogonal functional roles, complementing prior statistically grounded interpretability studies\(Kanget al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib27); Gevaet al\.,[2021](https://arxiv.org/html/2605.05668#bib.bib10)\)\. Finally, in Section[4\.3](https://arxiv.org/html/2605.05668#S4.SS3), we replace attention scores with predefined values, and the results indicate substantial redundancy in existing LVLM attention\.

### 4\.1Experimental Setups

Model settings\.We evaluate 15 open\-source LVLM variants spanning three mainstream architectures\. Specifically, we consider Qwen\-family models \(Qwen\-2\.5\-VL\(Team,[2025](https://arxiv.org/html/2605.05668#bib.bib49)\),CoF\(Weiet al\.,[2022](https://arxiv.org/html/2605.05668#bib.bib7)\),Reverse\(Wuet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib51)\),MM\-Eureka\(Menget al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib52)\),Orsta\(Maet al\.,[2025b](https://arxiv.org/html/2605.05668#bib.bib53)\),Ocean\-R1\(Minget al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib54)\)\), LLaVA\-1\.5\-family models \(LLaVA\-1\.5\(Liuet al\.,[2024a](https://arxiv.org/html/2605.05668#bib.bib55)\),Yi\-VL\(AIet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib56)\)\), and LLaVA\-NeXT\-family models \(LLaVA\-OneVision\(Liet al\.,[2024a](https://arxiv.org/html/2605.05668#bib.bib57)\),Mistral\-1\.6andVicuna\-1\.6\(Liuet al\.,[2024b](https://arxiv.org/html/2605.05668#bib.bib58)\)\)\.

Tasks and benchmarks\.Our experiments are conducted on a broad suite of multimodal benchmarks, including POPE\(Liet al\.,[2023b](https://arxiv.org/html/2605.05668#bib.bib59)\), 3DSRBench\(Maet al\.,[2025a](https://arxiv.org/html/2605.05668#bib.bib60)\), RealWorldQA\(Visheratin,[2024](https://arxiv.org/html/2605.05668#bib.bib61)\), MMMU\(Yueet al\.,[2023](https://arxiv.org/html/2605.05668#bib.bib62)\), VMC\-Bench\(Zhanget al\.,[2025c](https://arxiv.org/html/2605.05668#bib.bib63)\), MathVista\(Luet al\.,[2023](https://arxiv.org/html/2605.05668#bib.bib64)\), and HallusionBench\(Guanet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib65)\)\. Together, these benchmarks evaluate LVLM capabilities from basic visual perception to advanced multimodal reasoning\. Details on the benchmarks are provided in the Appendix[C\.1](https://arxiv.org/html/2605.05668#A3.SS1)\.

### 4\.2Interpreting the Functional Roles of Attention and FFN

![Refer to caption](https://arxiv.org/html/2605.05668v1/x2.png)

Figure 2:Model\-wiseRID\\mathrm\{RID\}andMixIG\\mathrm\{MixIG\}for Attention and FFN\. Across architectures and training variants, a clear and consistent separation emerges between attention and FFN contributions, indicating that our framework captures an intrinsic functional distinction between the two submodules\. Specifically,ϵRoPE=0\.062\\epsilon\_\{\\text\{RoPE\}\}=0\.062\.To systematically dissect the information dynamics within the residual stream, we track the evolution ofRID\\mathrm\{RID\}andMixIG\\mathrm\{MixIG\}across all layersll, using a random sample of 1000 instances from each dataset\. We design three comparative settings to isolate the contributions of learned architectural components versus stochastic interference:

1. 1\.Stochastic Baselines \(𝐗noisel\\mathbf\{X\}^\{l\}\_\{\\mathrm\{noise\}\}\):We introduce two randomization strategies to validate metric sensitivity and isolate learned functional properties: \(1\)Noise𝚫\\mathbf\{\\Delta\}, where the attention update is replaced by Gaussian noise matching the empirical moments ofΔ𝐗attn\\Delta\\mathbf\{X\}\_\{\\mathrm\{attn\}\}, serving as a negative control to verify the detection of unstructured, off\-manifold perturbations; \(2\)Noise𝐐𝐊𝐕\\mathbf\{QKV\}, where learned weight matrices are replaced by Gaussian initializations, serving to demonstrate that the subspace\-preserving nature of attention is a learned behavior, as unoptimized linear transformations would otherwise significantly perturb the feature space\. In both cases, we match the noise mean to that ofΔ𝐗attn\\Delta\\mathbf\{X\}\_\{\\mathrm\{attn\}\}\(Theorem[F\.3](https://arxiv.org/html/2605.05668#A6.Thmtheorem3)\)\.
2. 2\.Attention Contribution:We measure the transition from input to post\-attention states viaRID\(𝐗inl∣𝐗attnl\)\\mathrm\{RID\}\(\\mathbf\{X\}^\{l\}\_\{\\mathrm\{in\}\}\\mid\\mathbf\{X\}^\{l\}\_\{\\mathrm\{attn\}\}\)andMixIG\(𝐗inl∣𝐗attnl\)\\mathrm\{MixIG\}\(\\mathbf\{X\}^\{l\}\_\{\\mathrm\{in\}\}\\mid\\mathbf\{X\}^\{l\}\_\{\\mathrm\{attn\}\}\)\.
3. 3\.FFN Contribution:We measure the transition from post\-attention to post\-FFN states viaRID\(𝐗attnl∣𝐗ffnl\)\\mathrm\{RID\}\(\\mathbf\{X\}^\{l\}\_\{\\mathrm\{attn\}\}\\mid\\mathbf\{X\}^\{l\}\_\{\\mathrm\{ffn\}\}\)andMixIG\(𝐗attnl∣𝐗ffnl\)\\mathrm\{MixIG\}\(\\mathbf\{X\}^\{l\}\_\{\\mathrm\{attn\}\}\\mid\\mathbf\{X\}^\{l\}\_\{\\mathrm\{ffn\}\}\)\.

The aggregated statistics are shown in Table[1](https://arxiv.org/html/2605.05668#S4.T1)and Figure[2](https://arxiv.org/html/2605.05668#S4.F2), while layer\-wise trajectories are illustrated in Figure[3](https://arxiv.org/html/2605.05668#S4.F3)\.

Table 1:Module\-wiseRIDandMixIGwith qualitative signatures\.ModuleRIDMixIGCharacteristicNoiseΔ\\Delta0\.61\-0\.80Noise𝐐𝐊𝐕\\mathbf\{QKV\}0\.44\-0\.50Very high RIDNegative MixIGAttention0\.060\.61Low RID, High MixIGFeed\-Forward0\.210\.02High RID, Low MixIG

![Refer to caption](https://arxiv.org/html/2605.05668v1/x3.png)

Figure 3:Layer\-wiseRID\\mathrm\{RID\}andMixIG\\mathrm\{MixIG\}for Attention and FFN\. More sample visualizations are provided in the Figures[5](https://arxiv.org/html/2605.05668#A6.F5)–[10](https://arxiv.org/html/2605.05668#A6.F10)\.Table 2:Benchmark results under different SAP modes\. Weboldthe best results andunderlinethe runner\-ups*within each model*\.Model / VariantAffected LayersPOPERWQA3dSRBenchMMMUVMCBenchHallusionBenchMathVistaQwen\-2\.5\-VL\-3B/86\.1359\.3553\.4647\.7872\.3166\.9761\.5\+ Vis\. Attn\.87\.5861\.3853\.9448\.2972\.6768\.6661\.6\+ Patch Comp\.87\.4761\.6254\.1447\.8872\.5969\.1961\.7\+ Noise\[1, 27\]87\.4060\.5253\.8548\.2972\.6669\.0961\.6Qwen\-2\.5\-VL\-7B/86\.5465\.7555\.6351\.7774\.3469\.1963\.3\+ Vis\. Attn\.87\.6266\.1456\.6051\.1874\.7770\.9863\.1\+ Patch Comp\.87\.7366\.5456\.7451\.3274\.8071\.4063\.1\+ Noise\[1, 27\]87\.5166\.5456\.5651\.7674\.7670\.3562\.9LLaVA\-1\.5\-7B/74\.3847\.7147\.5334\.1248\.7141\.6321\.9\+ Vis\. Attn\.75\.7950\.2048\.6534\.7152\.2344\.2923\.2\+ Patch Comp\.75\.3050\.8548\.9635\.1852\.2942\.4223\.6\+ Noise\[18, 23\]75\.0247\.5848\.8135\.2350\.7042\.6922\.9LLaVA\-OneVision\-7B/86\.2156\.7355\.5441\.5166\.7946\.9463\.7\+ Patch Comp\.87\.7860\.2657\.2242\.7668\.7947\.4863\.7\+ Noise\[21, 27\]87\.2859\.0956\.7240\.9967\.8047\.0364\.3

Our observations are as follows:

Obs❶\.Metric Discriminability and Subspace Sensitivity\.Table[1](https://arxiv.org/html/2605.05668#S4.T1)validates our metrics through stochastic baselines\. NoiseΔ\\Deltaand Noise𝐐𝐊𝐕\\mathbf\{QKV\}serve as negative controls for testing whetherRIDandMixIGcan distinguish structured module updates from unstructured perturbations\. The substantially higher RID and negative MixIG of both baselines show that unstructured perturbations are correctly identified as off\-subspace disruptions with reduced token mixing, confirming that the low\-RID, positive\-MixIG profile of attention reflects a learned structured update rather than a metric artifact\.

Obs ❷\. The Orthogonal Roles of Attention and FFN\.Figure[2](https://arxiv.org/html/2605.05668#S4.F2)shows a consistent separation between attention and FFN across 15 LVLM variants\. Attention updates exhibit negligible innovation \(on the order ofϵRoPE\\epsilon\_\{\\text\{RoPE\}\}\) but strong reconfiguration, acting as a*subspace\-preserving operator*\. In contrast, FFN updates exhibit substantial innovation with weak reconfiguration, acting as a*subspace\-expanding operator*\. Together, these results quantify a clear division of labor: attention primarily*contextualizes*existing information via rearrangement, whereas FFNs primarily*compute*new semantic features via subspace expansion\.

Obs ❸\. Misallocation in visual attention\.The layer\-wise analysis in Figure[3](https://arxiv.org/html/2605.05668#S4.F3)suggests a heterogeneous role of attention across depth: while some layers exhibit pronounced reconfiguration \(e\.g\., Layer 0 and layers around40%40\\%depth\), cross\-token interactions remain sparse in most layers\. Motivated by this pattern, we further visualize attention\-mediated cross\-patch interactions in Figure[3](https://arxiv.org/html/2605.05668#S4.F3)\(b\) by linking patch pairs whose query–key score≥0\.1\\geq 0\.1\. We model patch interactions as a graph and measure the degree share of question\-relevant regions: this share is substantially lower for incorrect samples \(4\.2%4\.2\\%\) than for correct ones \(13\.1%13\.1\\%\), exposing a systematic*misallocation*of visual attention in current LVLM decoders\. We further discuss this analysis in Appendix[E](https://arxiv.org/html/2605.05668#A5)\.

Summary\.In this section, we validate the discriminability of our metrics \(Obs ❶\) and confirm a robust module\-level functional separation across diverse LVLM variants \(Obs ❷\)\. We further find that attention often fails to allocate and reorganize information around question\-relevant visual evidence \(Obs ❸\)\. This naturally raises a follow\-up question:if attention scores exhibit such misallocation, are they largely redundant and replaceable?We answer this question in the next section via targeted interventions\.

### 4\.3Replacing Attention Scores with Priors

![Refer to caption](https://arxiv.org/html/2605.05668v1/x4.png)

Figure 4:MHSA Replacement with Shared Attention Prior\.Causal masking is still applied after the replacement\.To further validate that a substantial portion of LVLM attention computation is redundant, we intervene on the decoder by replacing attention scores in selected layers with shared attention prior \(SAP\)\. As illustrated in Figure[4](https://arxiv.org/html/2605.05668#S4.F4), we consider three replacement modes:*\(i\) Visual\-encoder attention*, which injects attention maps derived from the visual encoder;*\(ii\) Patch complexity*, which uses a precomputed patch\-wise complexity prior based on within\-patch color variance and edge\-gradient magnitude; and*\(iii\) Noise*, which substitutes scores with Gaussian noise\.Details of SAP experimentsare provided in Appendix[C\.3](https://arxiv.org/html/2605.05668#A3.SS3)\.

Table[2](https://arxiv.org/html/2605.05668#S4.T2)reports the SAP replacement results on three backbone families \(Qwen\-2\.5\-VL,LLaVA\-1\.5, andLLaVA\-NeXT\)\.Detailed ablations on affected layers and heads, as well as experiments on larger models and more variants, are provided in Appendix[D](https://arxiv.org/html/2605.05668#A4)\.

Obs ❹\. Substantial redundancy in LVLM visual attention\.Across models and benchmarks \(Table[2](https://arxiv.org/html/2605.05668#S4.T2)\), replacing decoder attention scores with these predefined patterns does not degrade performance and can even yield improvements\. This indicates that, for current LVLMs, a large fraction of visual\-attention scoring is not functionally necessary, revealing substantial redundancy in decoder visual attention\. This observation is consistent with recent visual token pruning works\(Wenet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib70); Zhanget al\.,[2025a](https://arxiv.org/html/2605.05668#bib.bib69)\)\.

## 5Discussion and Conclusion

We propose a unified theoretical framework for assessing how residual\-stream updates shape representations in large models\. Applying it to LVLMs reveals a consistent module\-level functional separation, where attention primarily supports token\-level reconfiguration while FFNs drive innovation, and further diagnoses a pervasive failure mode in current decoders: visual attention often misallocates interaction away from question\-relevant evidence\. Motivated by this deficiency, we conduct aproof\-of\-conceptintervention by replacing attention scores in selected layers with simple predefined priors, and observe little to no degradation in capability, suggesting substantial redundancy in learned scoring\. Beyond these specific findings, our framework and empirical protocol offer a general tool for evaluating residual\-update mechanisms across model families and motivate targeted attention\-centric optimization\.

In conclusion, our framework turns LVLM residual updates into measurable innovation–reconfiguration dynamics and provides evidence that current Transformer\-based LVLMs can*get lost in attention*\. Future work includes extending the analysis to training\-time dynamics and leveraging the observed redundancy to design more efficient attention mechanisms or regularizers that preserve useful mixing while reducing unnecessary scoring\.

## Impact Statement

This paper presents work whose goal is to advance the field of Large Vision–Language Model Interpretability\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- S\. Abnar and W\. Zuidema \(2020\)Quantifying attention flow in transformers\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Cited by:[Appendix B](https://arxiv.org/html/2605.05668#A2.p1.1),[Appendix E](https://arxiv.org/html/2605.05668#A5.SS0.SSS0.Px1.p1.7),[§1](https://arxiv.org/html/2605.05668#S1.p3.1),[§2](https://arxiv.org/html/2605.05668#S2.p2.1)\.
- P\.\-A\. Absil, R\. Mahony, and R\. Sepulchre \(2008\)Optimization algorithms on matrix manifolds\.Princeton University Press\.Cited by:[§3\.2](https://arxiv.org/html/2605.05668#S3.SS2.p3.4)\.
- K\. K\. Agrawal, A\. K\. Mondal, A\. Ghosh, and B\. Richards \(2022\)α\\alpha\-ReQ: assessing representation quality in self\-supervised learning by measuring eigenspectrum decay\.Advances in Neural Information Processing Systems35,pp\. 17626–17638\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p3.1)\.
- 01\. AI, :, A\. Young, B\. Chen, C\. Li, C\. Huang, G\. Zhang, G\. Zhang, H\. Li, J\. Zhu, J\. Chen, J\. Chang, K\. Yu, P\. Liu, Q\. Liu, S\. Yue, S\. Yang, S\. Yang, T\. Yu, W\. Xie, W\. Huang, X\. Hu, X\. Ren, X\. Niu, P\. Nie, Y\. Xu, Y\. Liu, Y\. Wang, Y\. Cai, Z\. Gu, Z\. Liu, and Z\. Dai \(2024\)Yi: open foundation models by 01\.ai\.External Links:2403\.04652Cited by:[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p1.1)\.
- J\. Alayrac, J\. Donahue, P\. Luc, A\. Miech, I\. Barr, Y\. Hasson, K\. Lenc, A\. Mensch, K\. Millican, M\. Reynolds,et al\.\(2022\)Flamingo: a visual language model for few\-shot learning\.Advances in neural information processing systems35,pp\. 23716–23736\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p1.1)\.
- R\. Ali, F\. Caso, C\. Irwin, and P\. Liò \(2025\)Entropy\-lens: the information signature of transformer computations\.arXiv preprint arXiv:2502\.16570\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p3.1)\.
- E\. Ameisen, J\. Lindsey, A\. Pearce, W\. Gurnee, N\. L\. Turner, B\. Chen, C\. Citro, D\. Abrahams, S\. Carter, B\. Hosmer,et al\.\(2025\)Circuit tracing: revealing computational graphs in language models\.Transformer Circuits Thread6\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p1.1)\.
- L\. Basile, S\. Acevedo, L\. Bortolussi, F\. Anselmi, and A\. Rodriguez \(2024\)Intrinsic dimension correlation: uncovering nonlinear connections in multimodal representations\.arXiv preprint arXiv:2406\.15812\.Cited by:[Assumption 3\.1](https://arxiv.org/html/2605.05668#S3.Thmtheorem1.p1.1)\.
- L\. Basile, V\. Maiorca, D\. Doimo, F\. Locatello, and A\. Cazzaniga \(2025\)Head pursuit: probing attention specialization in multimodal transformers\.arXiv preprint arXiv:2510\.21518\.Cited by:[§C\.3](https://arxiv.org/html/2605.05668#A3.SS3.SSS0.Px2.p1.10)\.
- S\. Basu, M\. Grayson, C\. Morrison, B\. Nushi, S\. Feizi, and D\. Massiceti \(2024\)Understanding information storage and transfer in multi\-modal large language models\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 7400–7426\.Cited by:[Appendix B](https://arxiv.org/html/2605.05668#A2.p2.1)\.
- Y\. Belinkov and J\. Glass \(2019\)Analysis methods in neural language processing: a survey\.Transactions of the Association for Computational Linguistics7,pp\. 49–72\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p1.1)\.
- N\. Belrose, Z\. Furman, L\. Smith, D\. Halawi, I\. Ostrovsky, L\. McKinney, S\. Biderman, and J\. Steinhardt \(2023\)Eliciting latent predictions from transformers with the tuned lens\.arXiv preprint arXiv:2303\.08112\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p1.1)\.
- Y\. Bengio, A\. Courville, and P\. Vincent \(2013\)Representation learning: a review and new perspectives\.IEEE transactions on pattern analysis and machine intelligence35\(8\),pp\. 1798–1828\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p4.1),[Assumption 3\.1](https://arxiv.org/html/2605.05668#S3.Thmtheorem1)\.
- R\. BTet al\.\(2011\)Studio encoding parameters of digital television for standard 4: 3 and wide\-screen 16: 9 aspect ratios\.International radio consultative committee international telecommunication union, Switzerland, CCIR Rep\.Cited by:[§C\.3](https://arxiv.org/html/2605.05668#A3.SS3.SSS0.Px1.p3.1)\.
- L\. Chen, J\. Li, X\. Dong, P\. Zhang, Y\. Zang, Z\. Chen, H\. Duan, J\. Wang, Y\. Qiao, D\. Lin,et al\.\(2024\)Are we on the right way for evaluating large vision\-language models?\.Advances in Neural Information Processing Systems37,pp\. 27056–27087\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- A\. Conneau, G\. Kruszewski, G\. Lample, L\. Barrault, and M\. Baroni \(2018\)What you can cram into a single vector: probing sentence embeddings for linguistic properties\.arXiv preprint arXiv:1805\.01070\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p1.1)\.
- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2023\)Sparse autoencoders find highly interpretable features in language models\.arXiv preprint arXiv:2309\.08600\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p1.1)\.
- M\. Deb and T\. Ogunfunmi \(2025\)Information\-theoretical analysis of a transformer\-based generative ai model\.Entropy27\(6\),pp\. 589\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p3.1)\.
- X\. Du, F\. Mo, M\. Wen, T\. Gu, H\. Zheng, H\. Jin, and J\. Shi \(2025\)Multi\-turn jailbreaking large language models via attention shifting\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 23814–23822\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p2.1)\.
- J\. Dunefsky, P\. Chlenski, and N\. Nanda \(2024\)Transcoders find interpretable llm feature circuits\.Advances in Neural Information Processing Systems37,pp\. 24375–24410\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p1.1)\.
- C\. Eckart and G\. Young \(1936\)The approximation of one matrix by another of lower rank\.Psychometrika1\(3\),pp\. 211–218\.Cited by:[Theorem F\.2](https://arxiv.org/html/2605.05668#A6.Thmtheorem2),[§3\.2\.1](https://arxiv.org/html/2605.05668#S3.SS2.SSS1.p1.2)\.
- E\. Edelman, N\. Tsilivis, B\. Edelman, E\. Malach, and S\. Goel \(2024\)The evolution of statistical induction heads: in\-context learning markov chains\.Advances in neural information processing systems37,pp\. 64273–64311\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p2.1)\.
- N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly,et al\.\(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread1\(1\),pp\. 12\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p2.1),[§3\.1\.2](https://arxiv.org/html/2605.05668#S3.SS1.SSS2.p4.1)\.
- A\. Elhelo and M\. Geva \(2025\)Inferring functionality of attention heads from their parameters\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 17701–17733\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p2.1)\.
- J\. Fang, H\. Jiang, K\. Wang, Y\. Ma, J\. Shi, X\. Wang, X\. He, and T\. Chua \(2025\)AlphaEdit: null\-space constrained knowledge editing for language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=HvSytvg3Jh)Cited by:[Appendix B](https://arxiv.org/html/2605.05668#A2.p2.1)\.
- Y\. Gardinazzi, K\. Viswanathan, G\. Panerai, A\. Ansuini, A\. Cazzaniga, and M\. Biagetti \(2025\)Persistent topological features in large language models\.InForty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025,External Links:[Link](https://openreview.net/forum?id=qAHnSkHvsm)Cited by:[Assumption 3\.1](https://arxiv.org/html/2605.05668#S3.Thmtheorem1.p1.1)\.
- M\. Geva, R\. Schuster, J\. Berant, and O\. Levy \(2021\)Transformer feed\-forward layers are key\-value memories\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 5484–5495\.Cited by:[Appendix B](https://arxiv.org/html/2605.05668#A2.p1.1),[§1](https://arxiv.org/html/2605.05668#S1.p2.1),[§2](https://arxiv.org/html/2605.05668#S2.p2.1),[§4](https://arxiv.org/html/2605.05668#S4.p1.1)\.
- G\. H\. Golub and C\. F\. Van Loan \(2013\)Matrix computations\.JHU press\.Cited by:[Definition 3\.3](https://arxiv.org/html/2605.05668#S3.Thmtheorem3)\.
- Y\. Goyal, T\. Khot, D\. Summers\-Stay, D\. Batra, and D\. Parikh \(2017\)Making the v in vqa matter: elevating the role of image understanding in visual question answering\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 6904–6913\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- T\. Guan, F\. Liu, X\. Wu, R\. Xian, Z\. Li, X\. Liu, X\. Wang, L\. Chen, F\. Huang, Y\. Yacoob, D\. Manocha, and T\. Zhou \(2024\)HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision\-language models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 14375–14385\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p9.1),[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p2.1)\.
- W\. Guan, L\. Li, J\. Liu, B\. Li, P\. Fu, C\. Fang, X\. Hao, C\. Ma, and W\. Wang \(2026\)Mitigating overthinking in large reasoning language models via reasoning path deviation monitoring\.arXiv preprint arXiv:2603\.14251\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p3.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p1.1)\.
- D\. Gurari, Q\. Li, A\. J\. Stangl, A\. Guo, C\. Lin, K\. Grauman, J\. Luo, and J\. P\. Bigham \(2018\)Vizwiz grand challenge: answering visual questions from blind people\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 3608–3617\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- X\. Hao, L\. Zhou, Z\. Huang, Z\. Hou, Y\. Tang, L\. Zhang, G\. Li, Z\. Lu, S\. Ren, X\. Meng,et al\.\(2025\)Mimo\-embodied: x\-embodied foundation model technical report\.arXiv preprint arXiv:2511\.16518\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p1.1)\.
- B\. Hassibi, T\. Kailath, and A\. H\. Sayed \(2000\)Linear estimation\.Prentice Hall, Englewood Cliffs\.Cited by:[Definition 3\.5](https://arxiv.org/html/2605.05668#S3.Thmtheorem5.p1.5)\.
- A\. Havrilla and W\. Liao \(2024\)Understanding scaling laws with statistical and approximation theory for transformer neural networks on intrinsically low\-dimensional data\.Advances in Neural Information Processing Systems37,pp\. 42162–42210\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p3.1)\.
- J\. Hewitt and C\. D\. Manning \(2019\)A structural probe for finding syntax in word representations\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 4129–4138\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p1.1)\.
- D\. A\. Hudson and C\. D\. Manning \(2019\)Gqa: a new dataset for real\-world visual reasoning and compositional question answering\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 6700–6709\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- A\. Jaech, A\. Kalai, A\. Lerer, A\. Richardson, A\. El\-Kishky, A\. Low, A\. Helyar, A\. Madry, A\. Beutel, A\. Carney,et al\.\(2024\)Openai o1 system card\.arXiv preprint arXiv:2412\.16720\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p1.1)\.
- S\. Jain and B\. C\. Wallace \(2019\)Attention is not explanation\.arXiv preprint arXiv:1902\.10186\.Cited by:[Appendix B](https://arxiv.org/html/2605.05668#A2.p1.1),[§1](https://arxiv.org/html/2605.05668#S1.p3.1),[§2](https://arxiv.org/html/2605.05668#S2.p2.1)\.
- P\. Kahardipraja, R\. Achtibat, T\. Wiegand, W\. Samek, and S\. Lapuschkin \(2025\)The atlas of in\-context learning: how attention heads shape in\-context retrieval augmentation\.arXiv preprint arXiv:2505\.15807\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p3.1),[§2](https://arxiv.org/html/2605.05668#S2.p2.1)\.
- S\. Kang, J\. Kim, J\. Kim, and S\. J\. Hwang \(2025\)See what you are told: visual attention sink in large multimodal models\.arXiv preprint arXiv:2503\.03321\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p3.1),[§4](https://arxiv.org/html/2605.05668#S4.p1.1)\.
- A\. Kembhavi, M\. Salvato, E\. Kolve, M\. Seo, H\. Hajishirzi, and A\. Farhadi \(2016\)A diagram is worth a dozen images\.InEuropean conference on computer vision,pp\. 235–251\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- J\. Kim, S\. Kang, J\. Park, J\. Kim, and S\. J\. Hwang \(2025\)Interpreting attention heads for image\-to\-text information flow in large vision\-language models\.arXiv preprint arXiv:2509\.17588\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p2.1)\.
- Y\. Kim, M\. Yim, and K\. Y\. Song \(2024\)Tablevqa\-bench: a visual question answering benchmark on multiple table domains\.arXiv preprint arXiv:2404\.19205\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- G\. Kobayashi, T\. Kuribayashi, S\. Yokoi, and K\. Inui \(2024\)Analyzing feed\-forward blocks in transformers through the lens of attention maps\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=mYWsyTuiRp)Cited by:[Appendix B](https://arxiv.org/html/2605.05668#A2.p2.1)\.
- Q\. Lai, Y\. Li, A\. Zeng, M\. Liu, H\. Sun, and Q\. Xu \(2021\)Information bottleneck approach to spatial attention learning\.arXiv preprint arXiv:2108\.03418\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p3.1)\.
- B\. Li, Y\. Zhang, D\. Guo, R\. Zhang, F\. Li, H\. Zhang, K\. Zhang, P\. Zhang, Y\. Li, Z\. Liu,et al\.\(2024a\)Llava\-onevision: easy visual task transfer\.arXiv preprint arXiv:2408\.03326\.Cited by:[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p1.1)\.
- B\. Li, Y\. Ge, Y\. Ge, G\. Wang, R\. Wang, R\. Zhang, and Y\. Shan \(2024b\)Seed\-bench: benchmarking multimodal large language models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13299–13308\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- J\. Li, D\. Li, S\. Savarese, and S\. Hoi \(2023a\)Blip\-2: bootstrapping language\-image pre\-training with frozen image encoders and large language models\.InInternational conference on machine learning,pp\. 19730–19742\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p1.1)\.
- W\. Li, R\. Tang, C\. Li, C\. Zhang, I\. Vulic, and A\. Søgaard \(2025\)Lost in embeddings: information loss in vision\-language models\.arXiv preprint arXiv:2509\.119862\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p3.1)\.
- Y\. Li, Y\. Du, K\. Zhou, J\. Wang, X\. Zhao, and J\. Wen \(2023b\)Evaluating object hallucination in large vision\-language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 292–305\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.20),[Link](https://aclanthology.org/2023.emnlp-main.20/)Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p2.1)\.
- T\. Lin, M\. Maire, S\. Belongie, L\. Bourdev, R\. Girshick, J\. Hays, P\. Perona, D\. Ramanan, P\. Dollár, and C\. L\. Zitnick \(2014\)Microsoft coco: common objects in context\.InComputer Vision – ECCV 2014,D\. J\. Fleet, T\. Pajdla, B\. Schiele, and T\. Tuytelaars \(Eds\.\),Lecture Notes in Computer Science, Vol\.8693,Cham,pp\. 740–755\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-10602-1%5F48),[Link](https://doi.org/10.1007/978-3-319-10602-1_48)Cited by:[Appendix E](https://arxiv.org/html/2605.05668#A5.SS0.SSS0.Px2.p1.2)\.
- C\. Liu, Z\. Xu, Q\. Wei, J\. Wu, J\. Zou, X\. E\. Wang, Y\. Zhou, and S\. Liu \(2025\)More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models\.arXiv preprint arXiv:2505\.21523\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p3.1)\.
- H\. Liu, C\. Li, Y\. Li, and Y\. J\. Lee \(2024a\)Improved baselines with visual instruction tuning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 26296–26306\.Cited by:[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p1.1)\.
- H\. Liu, C\. Li, Y\. Li, B\. Li, Y\. Zhang, S\. Shen, and Y\. J\. Lee \(2024b\)LLaVA\-next: improved reasoning, ocr, and world knowledge\.External Links:[Link](https://llava-vl.github.io/blog/2024-01-30-llava-next/)Cited by:[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p1.1)\.
- H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee \(2023\)Visual instruction tuning\.Advances in neural information processing systems36,pp\. 34892–34916\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p1.1)\.
- Z\. Liu, Y\. Lin, Y\. Cao, H\. Hu, Y\. Wei, Z\. Zhang, S\. Lin, and B\. Guo \(2021\)Swin transformer: hierarchical vision transformer using shifted windows\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 10012–10022\.Cited by:[§C\.3](https://arxiv.org/html/2605.05668#A3.SS3.SSS0.Px1.p2.2)\.
- P\. Lu, H\. Bansal, T\. Xia, J\. Liu, C\. Li, H\. Hajishirzi, H\. Cheng, K\. Chang, M\. Galley, and J\. Gao \(2023\)MathVista: evaluating mathematical reasoning of foundation models in visual contexts\.arXiv preprint arXiv:2310\.02255\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1),[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p8.1),[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p2.1)\.
- P\. Lu, S\. Mishra, T\. Xia, L\. Qiu, K\. Chang, S\. Zhu, O\. Tafjord, P\. Clark, and A\. Kalyan \(2022\)Learn to explain: multimodal reasoning via thought chains for science question answering\.Advances in Neural Information Processing Systems35,pp\. 2507–2521\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- W\. Ma, H\. Chen, G\. Zhang, Y\. Chou, J\. Chen, C\. de Melo, and A\. Yuille \(2025a\)3dsrbench: a comprehensive 3d spatial reasoning benchmark\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 6924–6934\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p3.1),[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p2.1)\.
- Y\. Ma, L\. Du, X\. Shen, S\. Chen, P\. Li, Q\. Ren, L\. Ma, Y\. Dai, P\. Liu, and J\. Yan \(2025b\)One rl to see them all: visual triple unified reinforcement learning\.arXiv preprint arXiv:2505\.18129\.Cited by:[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p1.1)\.
- K\. Marino, M\. Rastegari, A\. Farhadi, and R\. Mottaghi \(2019\)Ok\-vqa: a visual question answering benchmark requiring external knowledge\.InProceedings of the IEEE/cvf conference on computer vision and pattern recognition,pp\. 3195–3204\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- A\. Masry, X\. L\. Do, J\. Q\. Tan, S\. Joty, and E\. Hoque \(2022\)Chartqa: a benchmark for question answering about charts with visual and logical reasoning\.InFindings of the association for computational linguistics: ACL 2022,pp\. 2263–2279\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- M\. Mathew, V\. Bagal, R\. Tito, D\. Karatzas, E\. Valveny, and C\. Jawahar \(2022\)Infographicvqa\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,pp\. 1697–1706\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- M\. Mathew, D\. Karatzas, and C\. Jawahar \(2021\)Docvqa: a dataset for vqa on document images\.InProceedings of the IEEE/CVF winter conference on applications of computer vision,pp\. 2200–2209\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- F\. Meng, L\. Du, Z\. Liu, Z\. Zhou, Q\. Lu, D\. Fu, B\. Shi, W\. Wang, J\. He, K\. Zhang, P\. Luo, Y\. Qiao, Q\. Zhang, and W\. Shao \(2025\)MM\-eureka: exploring visual aha moment with rule\-based large\-scale reinforcement learning\.External Links:2503\.07365,[Link](https://arxiv.org/abs/2503.07365)Cited by:[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 17359–17372\.Cited by:[Appendix B](https://arxiv.org/html/2605.05668#A2.p2.1)\.
- K\. Meng, A\. Sen Sharma, A\. Andonian, Y\. Belinkov, and D\. Bau \(2023\)Mass\-editing memory in a transformer\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=MkbcAHIYgyS)Cited by:[Appendix B](https://arxiv.org/html/2605.05668#A2.p2.1)\.
- L\. Ming, Y\. Li, S\. Chen, J\. Xu, Z\. Zhou, and W\. Chen \(2025\)Ocean\-r1: an open and generalizable large vision\-language model enhanced by reinforcement learning\.Note:[https://github\.com/VLM\-RL/Ocean\-R1](https://github.com/VLM-RL/Ocean-R1)Accessed: 2025\-04\-03Cited by:[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p1.1)\.
- A\. Mishra, S\. Shekhar, A\. K\. Singh, and A\. Chakraborty \(2019\)Ocr\-vqa: visual question answering by reading text in images\.In2019 international conference on document analysis and recognition \(ICDAR\),pp\. 947–952\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- A\. Nam, H\. Conklin, Y\. Yang, T\. Griffiths, J\. Cohen, and S\. Leslie \(2025\)Causal head gating: a framework for interpreting roles of attention heads in transformers\.arXiv preprint arXiv:2505\.13737\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p3.1),[§2](https://arxiv.org/html/2605.05668#S2.p2.1)\.
- K\. Nishi, R\. Ramesh, M\. Okawa, M\. Khona, H\. Tanaka, and E\. S\. Lubana \(2025\)Representation shattering in transformers: A synthetic study with knowledge editing\.InForty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025,External Links:[Link](https://openreview.net/forum?id=BKOeyZal0x)Cited by:[Assumption 3\.1](https://arxiv.org/html/2605.05668#S3.Thmtheorem1.p1.1)\.
- C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell, Y\. Bai, A\. Chen,et al\.\(2022\)In\-context learning and induction heads\.arXiv preprint arXiv:2209\.11895\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p2.1)\.
- S\. Pertuz, D\. Puig, and M\. A\. Garcia \(2013\)Analysis of focus measure operators for shape\-from\-focus\.Pattern Recognition46\(5\),pp\. 1415–1432\.Cited by:[§C\.3](https://arxiv.org/html/2605.05668#A3.SS3.SSS0.Px1.p3.9)\.
- Z\. Qiu, Z\. Huang, Y\. Huang, and J\. Fu \(2024\)Empirical study on updating key\-value memories in transformer feed\-forward layers\.arXiv preprint arXiv:2402\.12233\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p2.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p1.1)\.
- A\. Razzhigaev, M\. Mikhalchuk, E\. Goncharova, I\. Oseledets, D\. Dimitrov, and A\. Kuznetsov \(2024\)The shape of learning: anisotropy and intrinsic dimensions in transformer\-based models\.InFindings of the Association for Computational Linguistics: EACL 2024,pp\. 868–874\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p3.1)\.
- O\. Roy and M\. Vetterli \(2007\)The effective rank: a measure of effective dimensionality\.In2007 15th European signal processing conference,pp\. 606–610\.Cited by:[Definition 3\.4](https://arxiv.org/html/2605.05668#S3.Thmtheorem4)\.
- D\. Schwenk, A\. Khandelwal, C\. Clark, K\. Marino, and R\. Mottaghi \(2022\)A\-okvqa: a benchmark for visual question answering using world knowledge\.InEuropean conference on computer vision,pp\. 146–162\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- S\. Serrano and N\. A\. Smith \(2019\)Is attention interpretable?\.arXiv preprint arXiv:1906\.03731\.Cited by:[Appendix B](https://arxiv.org/html/2605.05668#A2.p1.1),[§1](https://arxiv.org/html/2605.05668#S1.p3.1),[§2](https://arxiv.org/html/2605.05668#S2.p2.1)\.
- A\. Singh, V\. Natarajan, M\. Shah, Y\. Jiang, X\. Chen, D\. Batra, D\. Parikh, and M\. Rohrbach \(2019\)Towards vqa models that can read\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 8317–8326\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- O\. Skean, M\. R\. Arefin, D\. Zhao, N\. Patel, J\. Naghiyev, Y\. LeCun, and R\. Shwartz\-Ziv \(2025\)Layer by layer: uncovering hidden representations in language models\.arXiv preprint arXiv:2502\.02013\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p2.1),[§1](https://arxiv.org/html/2605.05668#S1.p3.1),[§2](https://arxiv.org/html/2605.05668#S2.p3.1)\.
- J\. Su, M\. Ahmed, Y\. Lu, S\. Pan, W\. Bo, and Y\. Liu \(2024\)Roformer: enhanced transformer with rotary position embedding\.Neurocomputing568,pp\. 127063\.Cited by:[Definition 3\.6](https://arxiv.org/html/2605.05668#S3.Thmtheorem6.p1.9)\.
- H\. Tan, Y\. Ji, X\. Hao, M\. Lin, P\. Wang, Z\. Wang, and S\. Zhang \(2025\)Reason\-rft: reinforcement fine\-tuning for visual reasoning\.arXiv e\-prints,pp\. arXiv–2503\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p1.1)\.
- Q\. Team \(2025\)Qwen2\.5\-vl\.External Links:[Link](https://qwenlm.github.io/blog/qwen2.5-vl/)Cited by:[§C\.3](https://arxiv.org/html/2605.05668#A3.SS3.SSS0.Px1.p2.2),[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p1.1)\.
- Y\. Tian, Y\. Wang, Z\. Zhang, B\. Chen, and S\. Du \(2023\)Joma: demystifying multilayer transformers via joint dynamics of mlp and attention\.arXiv preprint arXiv:2310\.00535\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p3.1)\.
- B\. Vandereycken \(2013\)Low\-rank matrix completion by riemannian optimization\.SIAM Journal on Optimization23\(2\),pp\. 1214–1236\.External Links:[Document](https://dx.doi.org/10.1137/110845768)Cited by:[§3\.2](https://arxiv.org/html/2605.05668#S3.SS2.p2.5)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p1.1),[§1](https://arxiv.org/html/2605.05668#S1.p2.1)\.
- Visheratin \(2024\)RealWorldQA\.Note:[https://huggingface\.co/datasets/visheratin/realworldqa](https://huggingface.co/datasets/visheratin/realworldqa)Accessed: 2025\-11\-21Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p4.1),[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1),[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p2.1)\.
- H\. Wang, J\. Zhang, and Q\. Ma \(2024a\)Exploring intrinsic dimension for vision\-language model pruning\.InForty\-first International Conference on Machine Learning,Cited by:[Assumption 3\.1](https://arxiv.org/html/2605.05668#S3.Thmtheorem1.p1.1)\.
- K\. Wang, J\. Pan, W\. Shi, Z\. Lu, H\. Ren, A\. Zhou, M\. Zhan, and H\. Li \(2024b\)Measuring multimodal mathematical reasoning with math\-vision dataset\.Advances in Neural Information Processing Systems37,pp\. 95095–95169\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p1.1)\.
- L\. Wei, Z\. Tan, C\. Li, J\. Wang, and W\. Huang \(2024\)Diff\-erank: a novel rank\-based metric for evaluating large language models\.Advances in Neural Information Processing Systems37,pp\. 39501–39521\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p3.1),[§2](https://arxiv.org/html/2605.05668#S2.p3.1)\.
- Z\. Wen, Y\. Gao, W\. Li, C\. He, and L\. Zhang \(2025\)Token pruning in multimodal large language models: are we solving the right problem?\.arXiv preprint arXiv:2502\.11501\.Cited by:[§4\.3](https://arxiv.org/html/2605.05668#S4.SS3.p3.1)\.
- S\. Wiegreffe and Y\. Pinter \(2019\)Attention is not not explanation\.arXiv preprint arXiv:1908\.04626\.Cited by:[Appendix B](https://arxiv.org/html/2605.05668#A2.p1.1),[§2](https://arxiv.org/html/2605.05668#S2.p2.1)\.
- T\. Wu, H\. Lee, J\. Ge, J\. E\. Gonzalez, T\. Darrell, and D\. M\. Chan \(2025\)Generate, but verify: reducing hallucination in vision\-language models with retrospective resampling\.arXiv preprint arXiv:2504\.13169\.Cited by:[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p1.1)\.
- Y\. Yao, N\. Zhang, Z\. Xi, M\. Wang, Z\. Xu, S\. Deng, and H\. Chen \(2024\)Knowledge circuits in pretrained transformers\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[Appendix B](https://arxiv.org/html/2605.05668#A2.p2.1)\.
- K\. Yin and J\. Steinhardt \(2025\)Which attention heads matter for in\-context learning?\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 72428–72461\.External Links:[Link](https://proceedings.mlr.press/v267/yin25e.html)Cited by:[Appendix B](https://arxiv.org/html/2605.05668#A2.p2.1)\.
- W\. Yu, Z\. Yang, L\. Li, J\. Wang, K\. Lin, Z\. Liu, X\. Wang, and L\. Wang \(2023\)Mm\-vet: evaluating large multimodal models for integrated capabilities\.arXiv preprint arXiv:2308\.02490\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1)\.
- Y\. Yu, S\. Buchanan, D\. Pai, T\. Chu, Z\. Wu, S\. Tong, H\. Bai, Y\. Zhai, B\. D\. Haeffele, and Y\. Ma \(2024\)White\-box transformers via sparse rate reduction: compression is all there is?\.Journal of Machine Learning Research25\(300\),pp\. 1–128\.Cited by:[§2](https://arxiv.org/html/2605.05668#S2.p3.1)\.
- X\. Yue, G\. Qu, X\. Chen,et al\.\(2023\)MMMU: a massive multi\-discipline multimodal understanding and reasoning benchmark for expert AGI\.arXiv preprint arXiv:2311\.16502\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p5.1),[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1),[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p2.1)\.
- Q\. Zhang, A\. Cheng, M\. Lu, R\. Zhang, Z\. Zhuo, J\. Cao, S\. Guo, Q\. She, and S\. Zhang \(2025a\)Beyond text\-visual attention: exploiting visual cues for effective token pruning in vlms\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 20857–20867\.Cited by:[§4\.3](https://arxiv.org/html/2605.05668#S4.SS3.p3.1)\.
- S\. Zhang, X\. Hao, Y\. Tang, L\. Zhang, P\. Wang, Z\. Wang, H\. Ma, and S\. Zhang \(2025b\)Video\-cot: a comprehensive dataset for spatiotemporal understanding of videos based on chain\-of\-thought\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 12745–12752\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p1.1)\.
- Y\. Zhang, Y\. Su, Y\. Liu, X\. Wang, J\. Burgess, E\. Sui, C\. Wang, J\. Aklilu, A\. Lozano, A\. Wei, L\. Schmidt, and S\. Yeung\-Levy \(2025c\)Automated generation of challenging multiple\-choice questions for vision language model evaluation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 29580–29590\.Cited by:[§C\.1](https://arxiv.org/html/2605.05668#A3.SS1.p6.1),[Appendix D](https://arxiv.org/html/2605.05668#A4.p1.1),[§4\.1](https://arxiv.org/html/2605.05668#S4.SS1.p2.1)\.
- Z\. Zhou, H\. Yu, X\. Zhang, R\. Xu, F\. Huang, K\. Wang, Y\. Liu, J\. Fang, and Y\. Li \(2024\)On the role of attention heads in large language model safety\.arXiv preprint arXiv:2410\.13708\.Cited by:[§1](https://arxiv.org/html/2605.05668#S1.p3.1),[§2](https://arxiv.org/html/2605.05668#S2.p2.1)\.

## Appendix ANotations

We summarize the notation used throughout this paper in Table[3](https://arxiv.org/html/2605.05668#A1.T3)\.

Table 3:Notations\.NotationDescription𝐗∈ℝS×H\\mathbf\{X\}\\in\\mathbb\{R\}^\{S\\times H\}Hidden\-state / residual\-stream representation matrix with token lengthSSand hidden sizeHH𝐗new,𝐗old,Δ𝐗\\mathbf\{X\}\_\{\\text\{new\}\},\\,\\mathbf\{X\}\_\{\\text\{old\}\},\\,\\Delta\\mathbf\{X\}Updated representation, pre\-update representation, and the additive residual update,𝐗new=𝐗old\+Δ𝐗\\mathbf\{X\}\_\{\\text\{new\}\}=\\mathbf\{X\}\_\{\\text\{old\}\}\+\\Delta\\mathbf\{X\}𝐗inl,𝐗attnl,𝐗ffnl\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{in\}\},\\,\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{attn\}\},\\,\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{ffn\}\}Layer\-llresidual\-stream states: layer input, post\-attention state, and post\-FFN stateΔ𝐗attnl,Δ𝐗ffnl\\Delta\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{attn\}\},\\,\\Delta\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{ffn\}\}Module\-wise residual updates at layerll:Δ𝐗attnl=𝐗attnl−𝐗inl\\Delta\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{attn\}\}=\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{attn\}\}\-\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{in\}\},Δ𝐗ffnl=𝐗ffnl−𝐗attnl\\Delta\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{ffn\}\}=\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{ffn\}\}\-\\mathbf\{X\}^\{\\,l\}\_\{\\mathrm\{attn\}\}𝐗=𝐔𝚺𝐕⊤\\mathbf\{X\}=\\mathbf\{U\}\\mathbf\{\\Sigma\}\\mathbf\{V\}^\{\\top\}Singular value decomposition \(SVD\) of𝐗\\mathbf\{X\}with orthonormal factors𝐔,𝐕\\mathbf\{U\},\\mathbf\{V\}and singular spectrum𝚺\\mathbf\{\\Sigma\}ℐ\(𝐗\)=\(𝒮𝐗,𝒟𝐗\)\\mathcal\{I\}\(\\mathbf\{X\}\)=\\big\(\\mathcal\{S\}\_\{\\mathbf\{X\}\},\\,\\mathcal\{D\}\_\{\\mathbf\{X\}\}\\big\)Representation information, decomposed into spectrum complexity𝒮𝐗\\mathcal\{S\}\_\{\\mathbf\{X\}\}and support𝒟𝐗\\mathcal\{D\}\_\{\\mathbf\{X\}\}𝒞\(𝐗\),ℛ\(𝐗\)\\mathcal\{C\}\(\\mathbf\{X\}\),\\,\\mathcal\{R\}\(\\mathbf\{X\}\)Column space and row space of𝐗\\mathbf\{X\}\(Grassmann points\)span\(𝐔\),span\(𝐕\)\\mathrm\{span\}\(\\mathbf\{U\}\),\\,\\mathrm\{span\}\(\\mathbf\{V\}\)Left and right singular subspaces induced by SVD factors𝐔\\mathbf\{U\}and𝐕\\mathbf\{V\}𝐏𝒰\\mathbf\{P\}\_\{\\mathcal\{U\}\}Orthogonal projector onto a subspace𝒰\\mathcal\{U\}Δ𝒮\(𝐗∣𝐗′\),Δ𝒟\(𝐗∣𝐗′\)\\Delta\\mathcal\{S\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\),\\,\\Delta\\mathcal\{D\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)Spectrum change and support innovation when transitioning from𝐗\\mathbf\{X\}to𝐗′\\mathbf\{X\}^\{\\prime\}RID\(𝐗∣𝐗′\)\\mathrm\{RID\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)Representation Information Discrepancy, measuring update\-induced innovation via spectrum change plus support innovationTME\(𝐗\)\\mathrm\{TME\}\(\\mathbf\{X\}\)Token Mixing Entropy, an entropy\-based measure of token\-to\-token mixing in𝐗\\mathbf\{X\}MixIG\(𝐗∣𝐗′\)\\mathrm\{MixIG\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)Mixing Information Gain, defined asTME\(𝐗′\)−TME\(𝐗\)\\mathrm\{TME\}\(\\mathbf\{X\}^\{\\prime\}\)\-\\mathrm\{TME\}\(\\mathbf\{X\}\)to quantify reconfiguration𝐐,𝐊,𝐕,𝐀\\mathbf\{Q\},\\,\\mathbf\{K\},\\,\\mathbf\{V\},\\,\\mathbf\{A\}Query, key, value, and attention weights \(attention distribution / matrix\)𝐗in\(rope\),𝐗in\(no\-rope\)\\mathbf\{X\}^\{\\text\{\(rope\)\}\}\_\{\\mathrm\{in\}\},\\,\\mathbf\{X\}^\{\\text\{\(no\-rope\)\}\}\_\{\\mathrm\{in\}\}Layer\-input representations with RoPE positional encoding enabled vs\. disabled \(for calibrating intrinsic discrepancy\)

## Appendix BComparison with Prior Work

Prior module\-level interpretability studies have largely relied on attribution, tracing, or component\-specific functional analyses\. For attention, this line examines whether attention weights faithfully explain predictions, how attention\-mediated influence propagates across layers, or which heads implement specific functions\(Jain and Wallace,[2019](https://arxiv.org/html/2605.05668#bib.bib12); Serrano and Smith,[2019](https://arxiv.org/html/2605.05668#bib.bib13); Wiegreffe and Pinter,[2019](https://arxiv.org/html/2605.05668#bib.bib25); Abnar and Zuidema,[2020](https://arxiv.org/html/2605.05668#bib.bib98)\)\. For FFNs, prior work shows that feed\-forward layers can behave as key–value memories that associate textual patterns with output distributions\(Gevaet al\.,[2021](https://arxiv.org/html/2605.05668#bib.bib10)\)\. These approaches are valuable for localizing where a behavior or stored pattern appears\. In contrast, our framework asks a different question: how does each module transform the shared residual stream? We therefore characterize updates at the representation level through innovation and reconfiguration, rather than assigning a behavior to a specific token, head, neuron, or memory slot\.

This difference also changes the diagnostic perspective\. Prior methods are often strongest at identifying what function is present in a model, such as token attribution, head functionality, or stored associations\. For example, causal tracing and editing methods localize factual associations in feed\-forward modules\(Menget al\.,[2022](https://arxiv.org/html/2605.05668#bib.bib106),[2023](https://arxiv.org/html/2605.05668#bib.bib107); Fanget al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib111)\), circuit analyses identify knowledge\-related pathways\(Yaoet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib110)\), and head\-level studies characterize which attention heads matter for in\-context learning\(Yin and Steinhardt,[2025](https://arxiv.org/html/2605.05668#bib.bib109)\)\. In multimodal settings, related work further studies where visual and textual information is stored and transferred across MLLM components\(Basuet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib105)\), while FFN analyses examine how feed\-forward blocks reshape contextualization patterns\(Kobayashiet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib108)\)\. Our framework instead diagnoses what is insufficient or excessive in a residual update itself\.RIDasks whether a module injects new representational structure through spectral or subspace change, whileMixIGasks whether a module meaningfully redistributes token\-level information\. Thus, innovation and reconfiguration are not direct substitutes for memory, retrieval, or attribution; they are update\-level properties of the residual stream\. This makes the analysis actionable, because weak innovation or weak reconfiguration can be directly linked to a module, layer, or intervention target\.

The resulting conclusions are therefore complementary to prior work rather than redundant with it\. For FFNs, memory\-based interpretations explain how parameters can store and retrieve patterns, whereas our analysis measures how the FFN update changes representation geometry regardless of whether the source is parametric memory or contextual computation\. For attention, circuit\-level studies explain what algorithms attention can implement, whereas our claim concerns the visual side of current LVLM decoders: many attention updates show limited useful visual reconfiguration, and their score computation can often be replaced by simple priors without harming performance\. In this sense, our work shifts the focus from identifying existing functions to diagnosing residual\-stream deficiencies, revealing that current LVLMs do not consistently convert expensive visual attention scoring into necessary output\-discriminative information flow\.

## Appendix CDetails

### C\.1Dataset Details

Our experiments are conducted on a suite of benchmarks that probe complementary capabilities, spanning basic visual perception through advanced multimodal reasoning and robustness, including 3D and spatial reasoning, real\-world question answering, multidisciplinary knowledge, general\-purpose multimodal understanding, mathematical reasoning, and hallucination\-related robustness\. Detailed descriptions are provided below\.

POPE\(Liet al\.,[2023b](https://arxiv.org/html/2605.05668#bib.bib59)\)\. POPE is a diagnostic benchmark for*object hallucination*in LVLMs, it contains9,000questions split into three complementary subsets \(random, popular, adversarial\) to stress different hallucination modes\. We conduct the experiments in Section[4\.2](https://arxiv.org/html/2605.05668#S4.SS2)on POPE\.

3DSRBench\(Maet al\.,[2025a](https://arxiv.org/html/2605.05668#bib.bib60)\)\. 3DSRBench targets*3D and spatial reasoning*by evaluating whether a model can infer geometric relations beyond surface\-level recognition\. It includes1,500visual QA problems spanning diverse 3D reasoning skills \(e\.g\., relative depth, viewpoint\-dependent relations, and compositional spatial constraints\)\. The dataset is intended to separate “seeing” from “reasoning in 3D space” under multimodal inputs\.

RealWorldQA\(Visheratin,[2024](https://arxiv.org/html/2605.05668#bib.bib61)\)\. RealWorldQA evaluates*real\-world visual question answering*on everyday imagery, emphasizing practical robustness rather than curated or synthetic settings\. It contains765real\-world images paired with questions, covering varied scenes and conditions that commonly challenge LVLM grounding\.

MMMU\(Yueet al\.,[2023](https://arxiv.org/html/2605.05668#bib.bib62)\)\. MMMU is a large\-scale benchmark for*multidisciplinary multimodal understanding and reasoning*, spanning many academic domains\. It contains11,500\+questions across30subjects, covering both knowledge\-intensive understanding and higher\-level reasoning with visual inputs\. Because evaluation on the full test set is restricted, we follow the widely adopted protocol in prior work and conduct our experiments on thevalidationsplit \(900 samples\)\.

VMC\-Bench\(Zhanget al\.,[2025c](https://arxiv.org/html/2605.05668#bib.bib63)\)\. VMC\-Bench evaluates*general multimodal understanding*with an emphasis on challenging, automatically constructed multiple\-choice questions\. It transforms 20 widely\-used VQA datasets into a unified multiple\-choice benchmark\. These datasets can be broadly categorized to assess general capabilities of VLMs \(VQAv2\(Goyalet al\.,[2017](https://arxiv.org/html/2605.05668#bib.bib81)\), OKVQA\(Marinoet al\.,[2019](https://arxiv.org/html/2605.05668#bib.bib82)\), MMVet\(Yuet al\.,[2023](https://arxiv.org/html/2605.05668#bib.bib83)\), VizWiz\(Gurariet al\.,[2018](https://arxiv.org/html/2605.05668#bib.bib84)\), A\-OKVQA\(Schwenket al\.,[2022](https://arxiv.org/html/2605.05668#bib.bib85)\), MMStar\(Chenet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib86)\), SEEDBench\(Liet al\.,[2024b](https://arxiv.org/html/2605.05668#bib.bib87)\)\), reasoning capabilities \(MathVision\(Wanget al\.,[2024b](https://arxiv.org/html/2605.05668#bib.bib88)\), GQA\(Hudson and Manning,[2019](https://arxiv.org/html/2605.05668#bib.bib89)\), MMMU\(Yueet al\.,[2023](https://arxiv.org/html/2605.05668#bib.bib62)\), RealWorldQA\(Visheratin,[2024](https://arxiv.org/html/2605.05668#bib.bib61)\), MathVista\(Luet al\.,[2023](https://arxiv.org/html/2605.05668#bib.bib64)\), ScienceQA\(Luet al\.,[2022](https://arxiv.org/html/2605.05668#bib.bib90)\)\), OCR tasks \(OCRVQA\(Mishraet al\.,[2019](https://arxiv.org/html/2605.05668#bib.bib91)\), TextVQA\(Singhet al\.,[2019](https://arxiv.org/html/2605.05668#bib.bib92)\)\), and document and chart understanding \(DocVQA\(Mathewet al\.,[2021](https://arxiv.org/html/2605.05668#bib.bib93)\), InfoVQA\(Mathewet al\.,[2022](https://arxiv.org/html/2605.05668#bib.bib94)\), ChartQA\(Masryet al\.,[2022](https://arxiv.org/html/2605.05668#bib.bib95)\), TableVQABench\(Kimet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib96)\), AI2D\(Kembhaviet al\.,[2016](https://arxiv.org/html/2605.05668#bib.bib97)\)\)\.

VMC\-Bench contains9,018questions and is used to stress\-test model discrimination among closely competing options\.

MathVista\(Luet al\.,[2023](https://arxiv.org/html/2605.05668#bib.bib64)\)\. MathVista focuses on*visual mathematical reasoning*, requiring models to combine perception \(reading diagrams, charts, or scenes\) with mathematical problem solving\. It contains5,141QA instances covering a wide range of math\-reasoning skills grounded in visual context\. Because the official MathVista test evaluation is not publicly available, we conduct our experiments on thetestminisplit \(1,000 samples\)\.

HallusionBench\(Guanet al\.,[2024](https://arxiv.org/html/2605.05668#bib.bib65)\)\. HallusionBench is a targeted benchmark for*hallucination\-related robustness*, separating failures caused by visual misperception \(illusion\-like cases\) from those caused by language priors\. It contains1,129image–question pairs constructed to systematically elicit hallucination behaviors under controlled conditions\.

### C\.2Experimental Details

Dataset settings\.For each benchmark, we follow a consistent evaluation protocol across all models\. Specifically, we feed every image–question pair from the dataset to the model under the same input formatting and inference configuration, and compute the corresponding task metric using the official evaluation script whenever available\.

Model settings\.Within each model category, we adopt a unified inference setup to ensure fair comparison\. We group the evaluated LVLMs into three categories\.

\(i\) General\-purpose LVLMs\.This category includesQwen\-2\.5\-VL,LLaVA\-1\.5,Yi,LLaVA\-OneVision,Mistral\-1\.6, andVicuna\-1\.6\. For these models, we directly input the dataset image–question pair using their default chat templates\.

\(ii\) Vision\-query optimized LVLMs\.This category includesReverseandCoF\. For these models, we follow the inference and prompting settings specified in their respective papers to reproduce their intended evaluation protocol\.

\(iii\) Reasoning\-oriented LVLMs\.This category includesMM\-Eureka,Orsta, andOcean\-R1\. For these models, we append an explicit reasoning trigger to encourage open\-ended deliberation, and extract the final prediction from the<answer\>tags in the generated output\.

Reasoning\-trigger promptYou FIRST think about the reasoning process as an internal monologue and then provide the final answer\. The reasoning process MUST BE enclosed within`<think\>``</think\>`tags\. The final answer MUST BE in`<answer\>``</answer\>`tags\.

Generation hyperparameters\.We use the following decoding parameters for all experiments, and keep all unspecified options at their default values:

max\_new\_tokens=1024,\\displaystyle=024,output\_attentions=True,\\displaystyle=\\texttt\{True\},return\_dict\_in\_generate=True\.\\displaystyle=\\texttt\{True\}\.
Evaluation details\.We follow the official evaluation protocols of each dataset and report*accuracy*as the primary metric\. For open\-ended outputs \(e\.g\., from reasoning\-style models\), we parse the model’s prediction from the content enclosed by the`<think\>`tag and use it as the final answer for scoring\.

### C\.3SAP Details

This appendix provides implementation details for the SAP intervention in Sec\.[4\.3](https://arxiv.org/html/2605.05668#S4.SS3), including \(i\) the three SAP modes and \(ii\) how we select affected layers and heads for each architecture\.

##### SAP modes\.

Shared Attention Prior \(SAP\) replaces the original attention scores with a lightweight prior that is computed once per input and then shared across selected layers and heads, requiring substantially less computation than per\-layer score estimation\. We instantiate three SAP modes:

*\(i\) Visual\-encoder attention\.*Since the vision encoder is trained with vision\-centric objectives \(e\.g\., hierarchical vision encoders such as Swin Transformer\(Liuet al\.,[2021](https://arxiv.org/html/2605.05668#bib.bib71)\)\), we replace decoder attention scores with the last\-layer attention maps from the visual encoder as a natural alignment prior\. Note that the visual tokens used by the decoder may be merged relative to the encoder output \(e\.g\.,spatial\_merge\_size=2 in Qwen\-style encoders\(Team,[2025](https://arxiv.org/html/2605.05668#bib.bib49)\)\), so we align resolutions by average pooling the encoder attention over eachm×mm\\times mmerged block \(withm=spatial\_merge\_sizem=\\texttt\{spatial\\\_merge\\\_size\}\) before substitution\.

*\(ii\) Patch complexity\.*We compute a low\-cost patch prior from the input image using the decoder patch size\. For each patchpp, we first convert RGB to grayscale\(BT and others,[2011](https://arxiv.org/html/2605.05668#bib.bib100)\)

g\(u,v\)=0\.299R\(u,v\)\+0\.587G\(u,v\)\+0\.114B\(u,v\),g\(u,v\)=0\.299\\,R\(u,v\)\+0\.587\\,G\(u,v\)\+0\.114\\,B\(u,v\),then define an efficient gradient\-magnitude statistic via mean absolute finite differences:

Gx\(p\)=1HW′∑u=1H∑v=1W−1\|g\(u,v\+1\)−g\(u,v\)\|,Gy\(p\)=1H′W∑u=1H−1∑v=1W\|g\(u\+1,v\)−g\(u,v\)\|,G\_\{x\}\(p\)=\\frac\{1\}\{HW^\{\\prime\}\}\\sum\_\{u=1\}^\{H\}\\sum\_\{v=1\}^\{W\-1\}\\big\|g\(u,v\+1\)\-g\(u,v\)\\big\|,\\quad G\_\{y\}\(p\)=\\frac\{1\}\{H^\{\\prime\}W\}\\sum\_\{u=1\}^\{H\-1\}\\sum\_\{v=1\}^\{W\}\\big\|g\(u\+1,v\)\-g\(u,v\)\\big\|,grad\(p\)=Gx\(p\)\+Gy\(p\),var\(p\)=Var\(g\(u,v\)\),c\(p\)=grad\(p\)\+var\(p\)\.\\mathrm\{grad\}\(p\)=G\_\{x\}\(p\)\+G\_\{y\}\(p\),\\qquad\\mathrm\{var\}\(p\)=\\mathrm\{Var\}\\big\(g\(u,v\)\\big\),\\qquad c\(p\)=\\mathrm\{grad\}\(p\)\+\\mathrm\{var\}\(p\)\.HereHHandWWdenote the patch height and width \(in pixels\), and we setH′=H−1H^\{\\prime\}=H\-1andW′=W−1W^\{\\prime\}=W\-1to match the valid ranges of the finite differences\. Intuitively,grad\(p\)\\mathrm\{grad\}\(p\)summarizes local edge strength within the patch\(Pertuzet al\.,[2013](https://arxiv.org/html/2605.05668#bib.bib99)\), whilevar\(p\)\\mathrm\{var\}\(p\)measures within\-patch intensity dispersion; we combine them asc\(p\)c\(p\)and use\{c\(p\)\}\\\{c\(p\)\\\}as a patch\-wise attention prior\.

*\(iii\) Noise\.*We directly sample a Gaussian tensor with the same shape as the target attention scores and substitute it as the prior\.

##### Selecting layers and heads\.

We choose affected layers and heads via ablations \(see Appendix[D](https://arxiv.org/html/2605.05668#A4)\) for each architecture\. Layers are selected by depth order \(contiguous ranges\), while heads are selected by ranking their*non\-visual*attention mass\. Concretely, letAb,t,il,hA^\{l,h\}\_\{b,t,i\}denote the normalized attention weight at layerll, headhh, batch itembb, for query positionttover key positionii\. Let the visual\-token span be\[vstart,vend\)\[v\_\{\\text\{start\}\},v\_\{\\text\{end\}\}\)\. We define the non\-visual index set

𝒩=\{1,…,vstart−1\}∪\{vend,…,Sc\}\.\\mathcal\{N\}\\;=\\;\\\{1,\\dots,v\_\{\\text\{start\}\}\-1\\\}\\,\\cup\\,\\\{v\_\{\\text\{end\}\},\\dots,S\_\{c\}\\\}\.Using the last query positiont=−1t=\-1\(the current decoding step\), we score each head by the negative mean non\-visual mass:

sh=−1B\|𝒩\|∑b=1B∑i∈𝒩Ab,−1,il,h\.s\_\{h\}\\;=\\;\-\\,\\frac\{1\}\{B\\,\|\\mathcal\{N\}\|\}\\sum\_\{b=1\}^\{B\}\\sum\_\{i\\in\\mathcal\{N\}\}A^\{l,h\}\_\{b,\-1,i\}\.We rank heads byshs\_\{h\}and select a percentile band \(e\.g\.,\[0\.0,0\.3\]\[0\.0,0\.3\]\) per chosen layer; for heads within this band, we replace their attention scores with the shared SAP prior\. Our head\-selection strategy is motivated by the empirically supported*head specialization*hypothesis in multimodal Transformers: different heads and layers tend to preferentially route modality\-specific signals \(e\.g\., visual vs\. textual attributes\)\(Basileet al\.,[2025](https://arxiv.org/html/2605.05668#bib.bib72)\)\. To better decouple visual interactions from text\-dominated routing effects, we rank heads by their*non\-visual*attention mass and intervene on a chosen percentile range, so that the replacement primarily targets heads that allocate relatively less probability to non\-visual tokens\.

## Appendix DAdditional Results

In this section, we use VMC\-Bench\(Zhanget al\.,[2025c](https://arxiv.org/html/2605.05668#bib.bib63)\), which provides a comprehensive evaluation of LVLMs along five dimensions: General, Reasoning, OCR, Math, and Doc&Chart\.

### D\.1Ablation Studies for SAP

We conduct ablations across all models; except that the mode ablation is already reported in Table[2](https://arxiv.org/html/2605.05668#S4.T2), we focus here on ablating \(i\) the affected layers and \(ii\) the affected heads\.

Table 4:Ablation Study onAttention Heads\(Part I\): Evaluation of General Perception and Reasoning Capabilities across Different Parameter Settings\. The default configurations for each model is highlighted in bold red\.GeneralReasoningModelHeadsVQAv2VizWizOKVQAMMVetA\-OKVQAMMStarSEEDSciQARWQAMMMUGQAQwen\-2\.5\-VL\-7B\[0\.0, 0\.3\]83\.5682\.6084\.9471\.2278\.8259\.8674\.8180\.3253\.2154\.0981\.42\[0\.3, 0\.6\]90\.5187\.9990\.1273\.3886\.3563\.1879\.0183\.4859\.4055\.7785\.57\[0\.6, 0\.9\]89\.3587\.5089\.6371\.9484\.2462\.0078\.0284\.3957\.3455\.5383\.62\[0\.2, 0\.8\]89\.5886\.7688\.6472\.6684\.0061\.2879\.7584\.1658\.7253\.1284\.84\[0\.0, 1\.0\]84\.7287\.5084\.9475\.5479\.7657\.4876\.0581\.0055\.9650\.9679\.71LLaVA\-1\.5\-7B\[0\.0, 0\.3\]67\.1364\.2274\.5746\.7667\.7634\.4456\.5456\.5637\.6136\.5464\.06\[0\.3, 0\.6\]71\.3066\.9180\.2555\.4071\.0635\.6360\.9959\.7336\.9336\.5469\.19\[0\.6, 0\.9\]68\.5267\.1679\.0149\.6468\.2432\.7859\.0158\.6036\.7034\.6267\.24\[0\.2, 0\.8\]71\.3072\.0681\.7353\.2473\.4138\.2463\.4656\.7936\.2435\.1070\.17\[0\.0, 1\.0\]66\.4467\.8976\.0546\.7666\.5931\.5953\.5850\.6836\.4733\.1760\.88LLaVA\-OV\-7B\[0\.0, 0\.3\]51\.1657\.1162\.9647\.4858\.3541\.5751\.3651\.8142\.6635\.3455\.75\[0\.3, 0\.6\]83\.3385\.2988\.1568\.3585\.8852\.9777\.0481\.6754\.1343\.2784\.60\[0\.6, 0\.9\]83\.8084\.5687\.1667\.6386\.8252\.0276\.5481\.0055\.9642\.3184\.60\[0\.2, 0\.8\]84\.7283\.8284\.6964\.7586\.8249\.1779\.2677\.6056\.4241\.8382\.89\[0\.0, 1\.0\]31\.7131\.3731\.1123\.0231\.0628\.9829\.3832\.8129\.3629\.0931\.05

Table 5:Ablation Study onAttention Heads\(Part II\): Performance Analysis on Document/Chart and OCR Task\. The default configurations are highlighted in bold red\. AVG represents the overall average across all benchmarks, including Table[4](https://arxiv.org/html/2605.05668#A4.T4)\.MathDoc & ChartOCRModelHeadsVistaVisionDocVQATableChartQAInfoVQAAI2DTextVQAOCRVQAAVGQwen\-2\.5\-VL\-7B\[0\.0, 0\.3\]51\.4932\.3672\.6166\.2274\.7755\.5372\.8993\.0383\.1670\.35\[0\.3, 0\.6\]53\.9634\.3877\.0672\.0779\.5959\.2277\.2295\.2891\.7174\.76\[0\.6, 0\.9\]53\.4733\.4875\.9570\.7278\.6757\.8377\.6895\.7394\.0474\.06\[0\.2, 0\.8\]53\.9633\.2672\.6170\.2779\.3658\.7677\.4593\.9394\.0473\.86\[0\.0, 1\.0\]55\.4532\.1369\.2762\.3974\.7753\.9273\.5891\.6992\.7570\.98LLaVA\-1\.5\-7B\[0\.0, 0\.3\]22\.7725\.3934\.3025\.6826\.6130\.4143\.2855\.9665\.0346\.78\[0\.3, 0\.6\]25\.7428\.5437\.1929\.0532\.3429\.9541\.9161\.3567\.3649\.87\[0\.6, 0\.9\]29\.7026\.9738\.3127\.9330\.2829\.2643\.9661\.1268\.9148\.90\[0\.2, 0\.8\]25\.2526\.9738\.7529\.5032\.8031\.8042\.8263\.3770\.9850\.70\[0\.0, 1\.0\]26\.7330\.3433\.8527\.7028\.6732\.9541\.9160\.6768\.1347\.05LLaVA\-OV\-7B\[0\.0, 0\.3\]45\.0529\.6647\.8835\.8140\.3733\.4143\.5152\.3661\.6647\.26\[0\.3, 0\.6\]51\.4930\.7971\.4947\.7560\.0946\.5465\.8387\.4289\.9067\.80\[0\.6, 0\.9\]53\.4729\.6672\.8348\.4256\.6547\.4766\.2987\.4290\.6767\.76\[0\.2, 0\.8\]49\.0128\.9973\.0545\.5054\.1346\.7763\.5585\.1787\.5666\.29\[0\.0, 1\.0\]31\.6822\.2533\.6326\.8032\.5731\.1127\.5631\.6936\.5330\.14

The head ablation results are reported in Tables[4](https://arxiv.org/html/2605.05668#A4.T4)and[5](https://arxiv.org/html/2605.05668#A4.T5)\. Overall, intervening on mid\-quantile heads consistently outperforms modifying either tail, while the models are more sensitive to perturbations on the lower\-quantile heads\. Under our head partition criterion, these lower\-quantile heads primarily attend to non\-visual \(text\) tokens; altering them therefore disrupts textual representations and degrades performance\. For reference, we highlight the default affected\-head setting for each architecture in red in the tables\.

Table 6:Ablation Study onAffected Layers\(Part I\): Evaluation of General Perception and Reasoning Capabilities across Different Layer Configurations\. The default settings are highlighted in bold red\.GeneralReasoningModelLayersVQAv2VizWizOKVQAMMVetA\-OKVQAMMStarSEEDSciQARWQAMMMUGQALLaVA\-1\.5\-7B\[2, 7\]46\.3042\.4049\.8841\.7347\.5331\.8342\.2238\.6928\.2128\.1246\.45\[2, 13\]36\.8134\.3139\.2628\.7832\.4726\.3730\.8636\.2030\.7323\.8035\.21\[6, 11\]58\.3351\.9661\.2344\.6056\.2428\.2743\.2143\.8927\.2932\.9347\.19\[12, 17\]68\.5268\.8773\.5846\.0467\.0632\.5457\.0452\.7135\.0935\.3463\.57\[14, 25\]71\.3066\.4277\.7849\.6470\.3534\.6860\.9958\.3738\.3036\.0671\.88\[18, 23\]71\.3072\.0681\.7353\.2473\.4138\.2463\.4656\.7936\.2435\.1070\.17\[18, 29\]72\.9265\.9379\.5147\.4870\.3534\.4458\.7757\.6936\.2434\.6270\.42\[22, 31\]67\.8268\.1477\.5347\.4867\.7631\.3559\.5157\.2436\.7036\.5466\.99\[24, 29\]69\.9166\.1877\.2845\.3270\.8233\.0263\.2158\.6039\.6834\.6268\.22LLaVA\-OV\-7B\[0, 6\]65\.9768\.1470\.1244\.6065\.4140\.8656\.0566\.0643\.1235\.5871\.64\[0, 13\]59\.9560\.7862\.7240\.2958\.8240\.3854\.3259\.7333\.4930\.7760\.64\[7, 13\]82\.8781\.6283\.9561\.8786\.3547\.9877\.2878\.2854\.1340\.1480\.44\[14, 20\]84\.0387\.0186\.9163\.3186\.5948\.9375\.8080\.0952\.2942\.0782\.15\[14, 27\]83\.3383\.0986\.4264\.0387\.0652\.9776\.3081\.0054\.5943\.9985\.82\[21, 27\]83\.3385\.2988\.1568\.3585\.8852\.9777\.0481\.6754\.1343\.2784\.60

Table 7:Ablation Study onAffected Layers\(Part II\): Performance Analysis on Document/Chart Understanding and OCR Task\. The default settings are highlighted in bold red\.MathDoc & ChartOCRModelLayersVistaVisionDocVQATableChartQAInfoVQAAI2DTextVQAOCRVQAAVGLLaVA\-1\.5\-7B\[2, 7\]25\.2524\.2730\.5127\.2526\.8329\.7232\.5744\.0443\.0136\.34\[2, 13\]28\.7128\.7625\.1723\.4227\.7523\.2728\.0238\.8831\.6130\.52\[6, 11\]27\.2326\.5235\.6324\.7724\.3134\.1037\.1348\.3152\.8540\.30\[12, 17\]23\.7627\.4234\.5229\.5026\.1531\.8038\.9560\.4565\.5446\.92\[14, 25\]25\.7427\.4237\.4231\.9828\.6731\.3442\.8261\.8066\.5849\.48\[18, 23\]25\.2526\.9738\.7529\.5032\.8031\.8042\.8263\.3770\.9850\.70\[18, 29\]28\.7128\.5441\.4329\.5026\.1528\.8042\.1462\.4765\.8049\.10\[22, 31\]35\.1530\.1140\.0931\.3129\.5931\.1142\.6060\.9068\.6549\.33\[24, 29\]28\.2224\.2737\.1927\.0333\.2630\.6545\.1059\.1067\.3648\.95LLaVA\-OV\-7B\[0, 6\]37\.1327\.4249\.8938\.0644\.9531\.8051\.7164\.2768\.1352\.05\[0, 13\]37\.6228\.0944\.9932\.2135\.5536\.8746\.0155\.5168\.3947\.36\[7, 13\]53\.9630\.1171\.9442\.5750\.4647\.2464\.2484\.7287\.3165\.37\[14, 20\]54\.4628\.3173\.5046\.6252\.5243\.3264\.6984\.4988\.8666\.30\[14, 27\]49\.5027\.4271\.9445\.7258\.9445\.1665\.1587\.4289\.1266\.95\[21, 27\]51\.4930\.7971\.4947\.7560\.0946\.5465\.8387\.4289\.9067\.80

The layer ablation results are reported in Tables[6](https://arxiv.org/html/2605.05668#A4.T6)and[7](https://arxiv.org/html/2605.05668#A4.T7)\. We observe that LVLMs are highly sensitive to interventions in early layers, whereas perturbing middle or late layers typically causes only minor changes\. For instance, forLLaVA\-1\.5\-7B, intervening on Layers 1–7 reduces accuracy by13%13\\%, while intervening on Layers 22–31 incurs only a0\.5%0\.5\\%drop\. This pattern further supports a pervasive issue in current LVLMs: a substantial fraction of decoder attention computation is redundant\.

### D\.2Extending SAP to Other Architectures and Larger Variants

Table[8](https://arxiv.org/html/2605.05668#A4.T8)shows that head\-percentile interventions yield consistent, model\-dependent optima acrossQwen\-2\.5\-VLvariants under the same affected\-layer range \(\[1,27\]\[1,27\]\)\. In particular, mid\-percentile heads \(e\.g\.,\[0\.3,0\.6\]\[0\.3,0\.6\]\) are frequently the best\-performing choice for several variants, while extreme ranges can be substantially less stable for some models\. Overall, these results indicate that the sensitivity of SAP\-style interventions is structured rather than uniform across heads, motivating architecture\-aware head selection in subsequent experiments\.

Table 8:Head ablation onQwen\-2\.5\-VLarchitecture variants \(affected layers fixed to\[1,27\]\[1,27\]on the decoder\)\. Each column reports VMC accuracy under a head percentile interval\[hmin,hmax\]\[h\_\{\\min\},h\_\{\\max\}\]; the best setting per model isbolded\.Model\[0\.0, 0\.3\]\[0\.3, 0\.6\]\[0\.6, 0\.9\]\[0\.2, 0\.8\]\[0\.0, 1\.0\]CoF\-rl\-model\-7b0\.5670\.6370\.6310\.6130\.461CoF\-sft\-model\-7b0\.5580\.6400\.6340\.5950\.415MM\-Eureka\-Qwen\-32B0\.7270\.7060\.6760\.4570\.161MM\-Eureka\-Qwen\-7B0\.5310\.6230\.6270\.4360\.151Ocean\_R1\_7B\_Instruct0\.6390\.5940\.4990\.2480\.086Orsta\-7B0\.5370\.6040\.5990\.3090\.087Qwen2\.5\-VL\-32B\-Instruct0\.8270\.8290\.8280\.8160\.791reverse\_qwen25\_vl0\.0010\.0010\.0030\.0010\.007

## Appendix ELayer\-wise Attention Tracing

##### Tracing cross\-patch interactions\.

We provide a visualization tool to trace layer\-wise visual interactions from decoder attention\. For each layerll, we construct a visual interaction graph𝒢\(l\)=\(𝒱,ℰ\(l\)\)\\mathcal\{G\}^\{\(l\)\}=\(\\mathcal\{V\},\\mathcal\{E\}^\{\(l\)\}\)over visual patches\(Abnar and Zuidema,[2020](https://arxiv.org/html/2605.05668#bib.bib98)\), where𝒱=\{1,…,Sv\}\\mathcal\{V\}=\\\{1,\\dots,S\_\{v\}\\\}indexes visual tokens and edges are induced by thresholded visual\-to\-visual attention\. Let𝐀\(l\)∈\[0,1\]S×S\\mathbf\{A\}^\{\(l\)\}\\in\[0,1\]^\{S\\times S\}denote the head\-averaged attention matrix at layerll\(after averaging over heads\)\. Restricting to the visual block yields𝐀vv\(l\)∈\[0,1\]Sv×Sv\\mathbf\{A\}^\{\(l\)\}\_\{vv\}\\in\[0,1\]^\{S\_\{v\}\\times S\_\{v\}\}\. We include a directed edgej→ij\\\!\\to\\\!iwhenever

Avv\(l\)\(i,j\)≥τ,τ=0\.1,A^\{\(l\)\}\_\{vv\}\(i,j\)\\;\\geq\\;\\tau,\\qquad\\tau=0\.1,interpretingAvv\(l\)\(i,j\)A^\{\(l\)\}\_\{vv\}\(i,j\)as patchiiattending to patchjj\.

##### Constructing key regions from COCO instance annotations\.

To operationalize question\-relevant visual evidence, we leverage the fact that POPE samples are drawn from MSCOCO images and thus inherit COCO instance\-level object annotations with localization information \(e\.g\., bounding boxes\)\(Linet al\.,[2014](https://arxiv.org/html/2605.05668#bib.bib79)\)\. For each POPE query, we identify the referenced object category and retrieve its annotated bounding box\(es\)\. After applying the same image preprocessing as the LVLM \(e\.g\., resizing and patchification into ath×twt\_\{h\}\\times t\_\{w\}visual grid\), we map each bounding box to a set of visual patch indices by marking all patches whose spatial support intersects the box\. The union of these patches forms the key\-patch set𝒦⊆𝒱=\{1,…,Sv\}\\mathcal\{K\}\\subseteq\\mathcal\{V\}=\\\{1,\\dots,S\_\{v\}\\\}, which we use below to quantify how much of the layer\-wise visual interaction graph is routed through question\-relevant regions\.

##### Key\-region degree ratio\.

For each layer, we treat𝒢\(l\)\\mathcal\{G\}^\{\(l\)\}as the visual interaction graph and quantify how much interaction mass is routed through question\-relevant regions\. Let𝒦⊆𝒱\\mathcal\{K\}\\subseteq\\mathcal\{V\}be the set of key patches that correspond to question\-relevant visual evidence\. Define the key\-region degree ratio as

ρ\(l\)=\|\{\(j→i\)∈ℰ\(l\):i∈𝒦orj∈𝒦\}\|\|ℰ\(l\)\|\.\\rho^\{\(l\)\}\\;=\\;\\frac\{\\big\|\\\{\(j\\\!\\to\\\!i\)\\in\\mathcal\{E\}^\{\(l\)\}:\\;i\\in\\mathcal\{K\}\\ \\text\{or\}\\ j\\in\\mathcal\{K\}\\\}\\big\|\}\{\|\\mathcal\{E\}^\{\(l\)\}\|\}\.We randomly sampled 100 correctly answered cases and 100 incorrectly answered cases, and computedρ\(l\)\\rho^\{\(l\)\}for each case; see Figures[5](https://arxiv.org/html/2605.05668#A6.F5)–[10](https://arxiv.org/html/2605.05668#A6.F10)for case studies\. Averaged across samples, the key\-region degree ratio is4\.2%4\.2\\%for incorrect answers versus13\.1%13\.1\\%for correct answers, indicating that failures are associated with substantially weaker attention\-mediated interaction around question\-relevant visual evidence, consistent with systematic misallocation of visual attention\.

## Appendix FTheorem and Proofs

###### Lemma F\.1\(Range ofΔ𝒮\\Delta\\mathcal\{S\},Δ𝒟\\Delta\\mathcal\{D\}, and RID\)\.

For𝐗,𝐗′∈ℝS×H\\mathbf\{X\},\\mathbf\{X\}^\{\\prime\}\\in\\mathbb\{R\}^\{S\\times H\}, we haveΔ𝒮\(𝐗∣𝐗′\)∈\[0,1\]\\Delta\\mathcal\{S\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)\\in\[0,1\]andΔ𝒟\(𝐗∣𝐗′\)∈\[0,1\]\\Delta\\mathcal\{D\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)\\in\[0,1\]\. Consequently,RID\(𝐗∣𝐗′\)∈\[0,2\]\\mathrm\{RID\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)\\in\[0,2\]\.

###### Proof\.

SinceeRank\(𝐙\)∈\[1,min⁡\{S,H\}\]\\mathrm\{eRank\}\(\\mathbf\{Z\}\)\\in\[1,\\min\\\{S,H\\\}\]for any𝐙\\mathbf\{Z\}, we have0≤\|eRank\(𝐗′\)−eRank\(𝐗\)\|≤min⁡\{S,H\}0\\leq\|\\mathrm\{eRank\}\(\\mathbf\{X\}^\{\\prime\}\)\-\\mathrm\{eRank\}\(\\mathbf\{X\}\)\|\\leq\\min\\\{S,H\\\}, hence

Δ𝒮\(𝐗∣𝐗′\)=\|eRank\(𝐗′\)−eRank\(𝐗\)\|min⁡\{S,H\}∈\[0,1\]\.\\Delta\\mathcal\{S\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)=\\frac\{\\big\|\\mathrm\{eRank\}\(\\mathbf\{X\}^\{\\prime\}\)\-\\mathrm\{eRank\}\(\\mathbf\{X\}\)\\big\|\}\{\\min\\\{S,H\\\}\}\\in\[0,1\]\.
Let𝐏\\mathbf\{P\}be any orthogonal projector\. Then𝐈−𝐏\\mathbf\{I\}\-\\mathbf\{P\}is also an orthogonal projector and is non\-expansive:‖\(𝐈−𝐏\)𝐙‖F≤‖𝐙‖F\\\|\(\\mathbf\{I\}\-\\mathbf\{P\}\)\\mathbf\{Z\}\\\|\_\{F\}\\leq\\\|\\mathbf\{Z\}\\\|\_\{F\}and‖𝐙\(𝐈−𝐏\)‖F≤‖𝐙‖F\\\|\\mathbf\{Z\}\(\\mathbf\{I\}\-\\mathbf\{P\}\)\\\|\_\{F\}\\leq\\\|\\mathbf\{Z\}\\\|\_\{F\}\. Applying this with𝐏=𝐏𝒞\(𝐗\)\\mathbf\{P\}=\\mathbf\{P\}\_\{\\mathcal\{C\}\(\\mathbf\{X\}\)\}and𝐏=𝐏ℛ\(𝐗\)\\mathbf\{P\}=\\mathbf\{P\}\_\{\\mathcal\{R\}\(\\mathbf\{X\}\)\}yields

‖\(𝐈−𝐏𝒞\(𝐗\)\)𝐗′‖F\+‖𝐗′\(𝐈−𝐏ℛ\(𝐗\)\)‖F≤2‖𝐗′‖F\.\\big\\\|\(\\mathbf\{I\}\-\\mathbf\{P\}\_\{\\mathcal\{C\}\(\\mathbf\{X\}\)\}\)\\mathbf\{X\}^\{\\prime\}\\big\\\|\_\{F\}\+\\big\\\|\\mathbf\{X\}^\{\\prime\}\(\\mathbf\{I\}\-\\mathbf\{P\}\_\{\\mathcal\{R\}\(\\mathbf\{X\}\)\}\)\\big\\\|\_\{F\}\\leq 2\\\|\\mathbf\{X\}^\{\\prime\}\\\|\_\{F\}\.Therefore, under the normalization by2‖𝐗′‖F2\\\|\\mathbf\{X\}^\{\\prime\}\\\|\_\{F\},

Δ𝒟\(𝐗∣𝐗′\)=‖\(𝐈−𝐏𝒞\(𝐗\)\)𝐗′‖F\+‖𝐗′\(𝐈−𝐏ℛ\(𝐗\)\)‖F2‖𝐗′‖F∈\[0,1\]\.\\Delta\\mathcal\{D\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)=\\frac\{\\big\\\|\(\\mathbf\{I\}\-\\mathbf\{P\}\_\{\\mathcal\{C\}\(\\mathbf\{X\}\)\}\)\\mathbf\{X\}^\{\\prime\}\\big\\\|\_\{F\}\+\\big\\\|\\mathbf\{X\}^\{\\prime\}\(\\mathbf\{I\}\-\\mathbf\{P\}\_\{\\mathcal\{R\}\(\\mathbf\{X\}\)\}\)\\big\\\|\_\{F\}\}\{2\\\|\\mathbf\{X\}^\{\\prime\}\\\|\_\{F\}\}\\in\[0,1\]\.Finally,RID=Δ𝒮\+Δ𝒟\\mathrm\{RID\}=\\Delta\\mathcal\{S\}\+\\Delta\\mathcal\{D\}givesRID\(𝐗∣𝐗′\)∈\[0,2\]\\mathrm\{RID\}\(\\mathbf\{X\}\\mid\\mathbf\{X\}^\{\\prime\}\)\\in\[0,2\]\. ∎

###### Theorem F\.2\(Eckart–Young–Mirsky Theorem\(Eckart and Young,[1936](https://arxiv.org/html/2605.05668#bib.bib41)\)\)\.

Let𝐗\\mathbf\{X\}have SVD as in Definition[3\.3](https://arxiv.org/html/2605.05668#S3.Thmtheorem3)\. For anyk≤Qk\\leq Q, define the rank\-kktruncation

𝐗k=∑i=1kσi𝐮i𝐯i⊤\.\\mathbf\{X\}\_\{k\}=\\sum\_\{i=1\}^\{k\}\\sigma\_\{i\}\\mathbf\{u\}\_\{i\}\\mathbf\{v\}\_\{i\}^\{\\top\}\.Then𝐗k\\mathbf\{X\}\_\{k\}solves the best rank\-kkapproximation problem under the Frobenius norm:

𝐗k∈arg⁡minrank\(𝐘\)≤k⁡‖𝐗−𝐘‖F\.\\mathbf\{X\}\_\{k\}\\in\\arg\\min\_\{\\mathrm\{rank\}\(\\mathbf\{Y\}\)\\leq k\}\\\|\\mathbf\{X\}\-\\mathbf\{Y\}\\\|\_\{F\}\.

###### Theorem F\.3\(Expectation Equivalence under Attention Noise Injection\)\.

Scenario 1: randomQKV\\mathbf\{QKV\}\.Consider an attention head withNNkey\-value pairs\. LetQnoise,Knoise,VnoiseQ\_\{\\text\{noise\}\},K\_\{\\text\{noise\}\},V\_\{\\text\{noise\}\}be the random Gaussian replacements for the original query, key, value matrices, where each has the same mean and variance as the originalQ,K,VQ,K,Vrespectively\. The attention output in scenario \(1\) \(replacingQ,K,VQ,K,Vby noise\) for a single query can be written as a weighted sum of the value vectors:

Ynoise=∑i=1Naivi\(noise\),Y\_\{\\text\{noise\}\}\\;=\\;\\sum\_\{i=1\}^\{N\}a\_\{i\}\\,v\_\{i\}^\{\\text\{\(noise\)\}\},wherevi\(noise\)v\_\{i\}^\{\\text\{\(noise\)\}\}is theii\-th row ofVnoiseV\_\{\\text\{noise\}\}andaia\_\{i\}is the attention weight for keyiigiven by the softmax:

ai=exp⁡\(\(q\(noise\)\)⊤ki\(noise\)/d\)∑j=1Nexp⁡\(\(q\(noise\)\)⊤kj\(noise\)/d\),a\_\{i\}\\;=\\;\\frac\{\\exp\\\!\\big\(\(q^\{\\text\{\(noise\)\}\}\)^\{\\top\}k\_\{i\}^\{\\text\{\(noise\)\}\}/\\sqrt\{d\}\\big\)\}\{\\sum\_\{j=1\}^\{N\}\\exp\\\!\\big\(\(q^\{\\text\{\(noise\)\}\}\)^\{\\top\}k\_\{j\}^\{\\text\{\(noise\)\}\}/\\sqrt\{d\}\\big\)\},withq\(noise\)q^\{\\text\{\(noise\)\}\}the query vector andki\(noise\)k\_\{i\}^\{\\text\{\(noise\)\}\}theii\-th key \(row ofKnoiseK\_\{\\text\{noise\}\}\)\. By construction of the softmax,∑i=1Nai=1\\sum\_\{i=1\}^\{N\}a\_\{i\}=1for any realization\. Under the assumption that the random keys\(k1\(noise\),…,kN\(noise\)\)\(k\_\{1\}^\{\\text\{\(noise\)\}\},\\dots,k\_\{N\}^\{\\text\{\(noise\)\}\}\)are i\.i\.d\. \(making all key positions statistically symmetric\), the attention weights\{ai\}\\\{a\_\{i\}\\\}are an exchangeable set\. In particular, by symmetry we have𝔼\[ai\]=1N\\mathbb\{E\}\[a\_\{i\}\]=\\frac\{1\}\{N\}for eachii\. Now taking expectation ofYnoiseY\_\{\\text\{noise\}\}\(over the randomQnoise,Knoise,VnoiseQ\_\{\\text\{noise\}\},K\_\{\\text\{noise\}\},V\_\{\\text\{noise\}\}\) and using the law of total expectation, we get:

𝔼\[Ynoise\]=𝔼\[∑i=1Naivi\(noise\)\]=𝔼\[𝔼\[∑i=1Naivi\(noise\)∣Vnoise\]\]\.\\mathbb\{E\}\[Y\_\{\\text\{noise\}\}\]\\;=\\;\\mathbb\{E\}\\Big\[\\sum\_\{i=1\}^\{N\}a\_\{i\}\\,v\_\{i\}^\{\\text\{\(noise\)\}\}\\Big\]\\;=\\;\\mathbb\{E\}\\Big\[\\mathbb\{E\}\\big\[\\sum\_\{i=1\}^\{N\}a\_\{i\}\\,v\_\{i\}^\{\\text\{\(noise\)\}\}\\mid V\_\{\\text\{noise\}\}\\big\]\\Big\]\.Conditioning on the random valuesVnoise=\{vi\(noise\)\}i=1NV\_\{\\text\{noise\}\}=\\\{v\_\{i\}^\{\\text\{\(noise\)\}\}\\\}\_\{i=1\}^\{N\}, the attention weights are independent ofVnoiseV\_\{\\text\{noise\}\}and still satisfy𝔼\[ai∣Vnoise\]=1N\\mathbb\{E\}\[a\_\{i\}\\mid V\_\{\\text\{noise\}\}\]=\\frac\{1\}\{N\}\. Thus

𝔼\[∑i=1Naivi\(noise\)\|Vnoise\]=∑i=1N𝔼\[ai∣Vnoise\]vi\(noise\)=1N∑i=1Nvi\(noise\)\.\\mathbb\{E\}\\Big\[\\sum\_\{i=1\}^\{N\}a\_\{i\}\\,v\_\{i\}^\{\\text\{\(noise\)\}\}\\,\\Big\|\\,V\_\{\\text\{noise\}\}\\Big\]\\;=\\;\\sum\_\{i=1\}^\{N\}\\mathbb\{E\}\[a\_\{i\}\\mid V\_\{\\text\{noise\}\}\]\\,v\_\{i\}^\{\\text\{\(noise\)\}\}\\;=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}v\_\{i\}^\{\\text\{\(noise\)\}\}\.The right\-hand side is simply the average of theNNi\.i\.d\. random value vectors\. Therefore, its expectation is the mean of theVnoiseV\_\{\\text\{noise\}\}distribution:

𝔼\[Ynoise\]=𝔼\[1N∑i=1Nvi\(noise\)\]=𝔼\[vi\(noise\)\]=μV,\\mathbb\{E\}\[Y\_\{\\text\{noise\}\}\]\\;=\\;\\mathbb\{E\}\\Big\[\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}v\_\{i\}^\{\\text\{\(noise\)\}\}\\Big\]\\;=\\;\\mathbb\{E\}\[v\_\{i\}^\{\\text\{\(noise\)\}\}\]\\;=\\;\\mu\_\{V\},whereμV\\mu\_\{V\}denotes the mean of the originalVV\(andVnoiseV\_\{\\text\{noise\}\}\) distribution\.

Scenario 2: randomΔ\\mathbf\{\\Delta\}\. In scenario \(2\), where we directly replace the final attention output with Gaussian noise of the same distribution as the trueY=AVY=AV\(withAAthe attention matrix\), the injected outputYdirectY\_\{\\text\{direct\}\}is a Gaussian random vector with mean set toμY\\mu\_\{Y\}, the mean of the original attention output\. Typically, if the original model’s parameters are approximately zero\-mean \(as is common in weight initialization\), the distribution of the true attention outputYYwill have meanμY≈0\\mu\_\{Y\}\\approx 0\. In our case above, we foundμY=μV\\mu\_\{Y\}=\\mu\_\{V\}, since the attention mechanism produces a convex combination of the values\. Under the assumption that the original attention output’s meanμY\\mu\_\{Y\}equalsμV\\mu\_\{V\}\(which holds, for example, if weights are zero\-centered so that queries and keys induce no bias in attention, or more generally under the symmetry argument given\), we have

𝔼\[Ydirect\]=μY=μV=𝔼\[Ynoise\]\.\\mathbb\{E\}\[Y\_\{\\text\{direct\}\}\]=\\mu\_\{Y\}=\\mu\_\{V\}=\\mathbb\{E\}\[Y\_\{\\text\{noise\}\}\]\.Thus, the mean of the noise\-injected output in scenario \(1\) is the same as the mean of the direct noise output in scenario \(2\)\. In other words, both replacement strategies produce outputs with the same expected mean\.

###### Theorem F\.4\(Manifold Coincidence Theorem for RID\)\.

We aim to show that ifRID\(X∣X′\)=0\\mathrm\{RID\}\(X\\mid X^\{\\prime\}\)=0, thenXXandX′X^\{\\prime\}share the same manifold structure – in particular,X′X^\{\\prime\}lies in the same underlying subspace asXXwith equivalent spectral complexity\. By the definition ofRepresentation Information Discrepancy \(RID\), we have

RID\(X∣X′\)=Δ𝒮\(X∣X′\)\+Δ𝒟\(X∣X′\)\.\\mathrm\{RID\}\(X\\mid X^\{\\prime\}\)\\;=\\;\\Delta\\mathcal\{S\}\(X\\mid X^\{\\prime\}\)\\;\+\\;\\Delta\\mathcal\{D\}\(X\\mid X^\{\\prime\}\)\.The conditionRID\(X∣X′\)=0\\mathrm\{RID\}\(X\\mid X^\{\\prime\}\)=0necessitates that both non\-negative components vanish:Δ𝒮\(X∣X′\)=0\\Delta\\mathcal\{S\}\(X\\mid X^\{\\prime\}\)=0andΔ𝒟\(X∣X′\)=0\\Delta\\mathcal\{D\}\(X\\mid X^\{\\prime\}\)=0\.

Firstly, the conditionΔ𝒮\(X∣X′\)=0\\Delta\\mathcal\{S\}\(X\\mid X^\{\\prime\}\)=0implies the invariance of the spectral complexity as measured by the effective rank\. Since the effective rank serves as a continuous proxy for the number of active degrees of freedom, its conservation indicates that the intrinsic dimensionality of the representation remains unchanged\. Under the manifold hypothesis characterizing𝐗\\mathbf\{X\}, this implies that the algebraic rank is preserved, i\.e\.,rank⁡\(X′\)=rank⁡\(X\)=r\\operatorname\{rank\}\(X^\{\\prime\}\)=\\operatorname\{rank\}\(X\)=r\. Consequently, both matrices reside within the same fixed\-rank manifold geometryℳr\\mathcal\{M\}\_\{r\}\.

Secondly,Δ𝒟\(X∣X′\)=0\\Delta\\mathcal\{D\}\(X\\mid X^\{\\prime\}\)=0signifies thatX′X^\{\\prime\}introduces no new*information support*relative toXX\. By the definition of support innovation, the projection residuals must be zero:

‖\(I−𝐏𝒞\(X\)\)X′‖F=0,‖X′\(I−𝐏ℛ\(X\)\)‖F=0,\\big\\\|\(I\-\\mathbf\{P\}\_\{\\mathcal\{C\}\(X\)\}\)\\,X^\{\\prime\}\\big\\\|\_\{F\}=0,\\qquad\\big\\\|X^\{\\prime\}\\,\(I\-\\mathbf\{P\}\_\{\\mathcal\{R\}\(X\)\}\)\\big\\\|\_\{F\}=0,where𝐏𝒞\(X\)\\mathbf\{P\}\_\{\\mathcal\{C\}\(X\)\}and𝐏ℛ\(X\)\\mathbf\{P\}\_\{\\mathcal\{R\}\(X\)\}are the orthogonal projectors onto the column space𝒞\(X\)\\mathcal\{C\}\(X\)and row spaceℛ\(X\)\\mathcal\{R\}\(X\)ofXX, respectively\. These conditions are algebraically equivalent to:

𝒞\(X′\)⊆𝒞\(X\),ℛ\(X′\)⊆ℛ\(X\)\.\\mathcal\{C\}\(X^\{\\prime\}\)\\subseteq\\mathcal\{C\}\(X\),\\qquad\\mathcal\{R\}\(X^\{\\prime\}\)\\subseteq\\mathcal\{R\}\(X\)\.Having established thatrank⁡\(X′\)=rank⁡\(X\)=r\\operatorname\{rank\}\(X^\{\\prime\}\)=\\operatorname\{rank\}\(X\)=r, it follows thatdim\(𝒞\(X′\)\)=dim\(𝒞\(X\)\)=r\\dim\(\\mathcal\{C\}\(X^\{\\prime\}\)\)=\\dim\(\\mathcal\{C\}\(X\)\)=r\. A fundamental result in linear algebra states that if a subspace𝒱\\mathcal\{V\}is contained in a subspace𝒲\\mathcal\{W\}of the same finite dimension, then𝒱=𝒲\\mathcal\{V\}=\\mathcal\{W\}\. Therefore, we conclude:

𝒞\(X′\)=𝒞\(X\),ℛ\(X′\)=ℛ\(X\)\.\\mathcal\{C\}\(X^\{\\prime\}\)=\\mathcal\{C\}\(X\),\\qquad\\mathcal\{R\}\(X^\{\\prime\}\)=\\mathcal\{R\}\(X\)\.This proves thatX′X^\{\\prime\}shares exactly the same left and right singular vector subspaces asXX, meaning the*information support*is identical:𝒟X′=𝒟X\\mathcal\{D\}\_\{X^\{\\prime\}\}=\\mathcal\{D\}\_\{X\}\. Combined with the unchanged spectrum \(𝒮X′=𝒮X\\mathcal\{S\}\_\{X^\{\\prime\}\}=\\mathcal\{S\}\_\{X\}\), we have

ℐ\(X′\)=\(𝒮X′,𝒟X′\)=\(𝒮X,𝒟X\)=ℐ\(X\)\.\\mathcal\{I\}\(X^\{\\prime\}\)\\;=\\;\(\\mathcal\{S\}\_\{X^\{\\prime\}\},\\,\\mathcal\{D\}\_\{X^\{\\prime\}\}\)\\;=\\;\(\\mathcal\{S\}\_\{X\},\\,\\mathcal\{D\}\_\{X\}\)\\;=\\;\\mathcal\{I\}\(X\)\.In conclusion, whenRID\(X∣X′\)=0\\mathrm\{RID\}\(X\\mid X^\{\\prime\}\)=0,X′X^\{\\prime\}contains no new representation information compared toXX\. Geometrically,XXandX′X^\{\\prime\}coincide in the manifold parameterization: they possess the same rank and occupy the same supporting subspaces\. Thus,XXandX′X^\{\\prime\}share one manifold space, differing only by an internal reconfiguration of information within that shared subspace\.

![Refer to caption](https://arxiv.org/html/2605.05668v1/x5.png)

Figure 5:Case 1\.Layer\-wise visual attention tracing\. Only layer 23 exhibits cross\-patch interactions within the key region \(cows\)\.![Refer to caption](https://arxiv.org/html/2605.05668v1/x6.png)

Figure 6:Case 2\.Layer\-wise visual attention tracing\. Layer 23 and 26 exhibit cross\-patch interactions within the key region \(surfboard\)\.![Refer to caption](https://arxiv.org/html/2605.05668v1/x7.png)

Figure 7:Case 3\.Layer\-wise visual attention tracing\. Layer 16 and 17 exhibit cross\-patch interactions within the key region \(person\)\.![Refer to caption](https://arxiv.org/html/2605.05668v1/x8.png)

Figure 8:Case 4\.Layer\-wise visual attention tracing\. Layer 20\-24 exhibit cross\-patch interactions within the key region \(laptop\)\.![Refer to caption](https://arxiv.org/html/2605.05668v1/x9.png)

Figure 9:Case 5\.Layer\-wise visual attention tracing\. Layer 23 and 24 exhibit cross\-patch interactions within the key region \(traffic light\)\.![Refer to caption](https://arxiv.org/html/2605.05668v1/x10.png)

Figure 10:Case 6\.Layer\-wise visual attention tracing\. Layer 20, 22 and 23 exhibits cross\-patch interactions within the key region \(suitcase\)\.
Large Vision-Language Models Get Lost in Attention

Similar Articles

Leveraging Vision-Language Models to Detect Attention in Educational Videos

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

Submit Feedback

Similar Articles

Leveraging Vision-Language Models to Detect Attention in Educational Videos
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning