Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

arXiv cs.CL Papers

Summary

This paper identifies a specialized subset of attention heads called CoRe heads in multimodal LLMs that exhibit functional sparsity in cross-modal retrieval. Causal interventions show these heads are crucial for multimodal reasoning, and leveraging this sparsity can accelerate inference.

arXiv:2606.05843v1 Announce Type: new Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In this paper, we present an in-depth interpretability study that uncovers a profound structural property within MLLMs: functional sparsity in cross-modal retrieval. Leveraging a token-level metric termed Retrieval Attention Mass (RAM), we identify and characterize a highly specialized subset of attention heads, referred to as Context-aware Retrieval (CoRe) heads. Across diverse visual domains and model scales, we observe a clear functional division: CoRe heads act as dedicated information extractors, while most other heads distribute attention over broader contextual regions. Causal interventions further demonstrate the necessity of these specialized heads. Ablating only the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal effect. Moreover, acceleration experiments validate the utility of CoRe heads, showing that leveraging this localized sparsity significantly accelerates inference while maintaining robust task performance. Our findings reveal a structural principle of functional sparsity within MLLMs, refining the current understanding of mechanistic interpretability and laying a theoretical foundation that can inspire future architecture design and model optimization.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:08 AM

# Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads
Source: [https://arxiv.org/html/2606.05843](https://arxiv.org/html/2606.05843)
Ruoxi Sun1, Quantong Qiu1, Juntao Li1, Zecheng Tang1, Yihang Lou2, Min Zhang1 1Soochow University 2Peking University \{ljt\}@suda\.edu\.cn

###### Abstract

While Multimodal Large Language Models \(MLLMs\) demonstrate remarkable proficiency on complex vision\-language tasks, the mechanisms by which they extract query\-relevant visual features from complex, noisy contexts remain opaque\. In this paper, we present an in\-depth interpretability study that uncovers a profound structural property within MLLMs:functional sparsityin cross\-modal retrieval\. Leveraging a token\-level metric termed Retrieval Attention Mass \(RAM\), we identify and characterize a highly specialized subset of attention heads, referred to asContext\-aware Retrieval \(CoRe\) heads\. Across diverse visual domains and model scales, we observe a clear functional division: CoRe heads act as dedicated information extractors, while most other heads distribute attention over broader contextual regions\. Causal interventions further demonstrate the necessity of these specialized heads\. Ablating only the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower\-ranked heads has minimal effect\. Moreover, acceleration experiments validate the utility of CoRe heads, showing that leveraging this localized sparsity significantly accelerates inference while maintaining robust task performance\. Our findings reveal a structural principle of functional sparsity within MLLMs, refining the current understanding of mechanistic interpretability and laying a theoretical foundation that can inspire future architecture design and model optimization\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.05843v1/x1.png)Figure 1:Functional specialization in MLLM attention heads on RefCOCOg\.Left \(CoRe Heads\):High\-attention regions correspond to context\-relevant objects\.Right \(Bottom Heads\):High\-attention regions show week context\-relevant\.Multimodal Large Language Models \(MLLMs\) have demonstrated remarkable capabilities in complex vision\-language tasks\[[11](https://arxiv.org/html/2606.05843#bib.bib2),[18](https://arxiv.org/html/2606.05843#bib.bib1),[26](https://arxiv.org/html/2606.05843#bib.bib3)\]\. These models map high\-dimensional visual signals into the semantic space of large language models\[[15](https://arxiv.org/html/2606.05843#bib.bib4)\]\. However, real\-world visual inputs are often cluttered with redundant information\. Consequently, robust multimodal reasoning is dependent on the model’s capacity to selectively retrieve and isolate sparse, task\-relevant visual cues from these cluttered or complex spatiotemporal scenes\. Yet, despite the profound empirical success of MLLMs, the precise internal mechanisms governing this critical cross\-modal information retrieval remain unexplored, presenting a significant gap in our understanding of multimodal mechanistic interpretability\.

![Refer to caption](https://arxiv.org/html/2606.05843v1/x2.png)Figure 2:Mechanistic evidence of functional specialization in MLLMs attention heads on VidSTG\.Left \(CoRe Heads\):A sparse subset of specialized heads acts as precise information extractors, surgically isolating context\-relevant entities \(e\.g\., “red car”, “adult in white”\) by filtering background noise across key frames\.Right \(Bottom Heads\):The vast majority of heads exhibit semantic dispersion, scattering attention across irrelevant regions and failing to ground the instruction\.While recent interpretability studies have identified specialized attention heads in MLLMs\[[3](https://arxiv.org/html/2606.05843#bib.bib9),[16](https://arxiv.org/html/2606.05843#bib.bib10),[2](https://arxiv.org/html/2606.05843#bib.bib18)\], how these models perform cross\-modal information retrieval remains poorly understood\. Current frameworks rely on coarse statistical heuristics such as spatial entropy\[[17](https://arxiv.org/html/2606.05843#bib.bib15)\]and attention aggregation\[[12](https://arxiv.org/html/2606.05843#bib.bib17)\], typically evaluated on simple static images\. Consequently, these approaches struggle to capture fine\-grained, query\-conditioned token\-level attention in dense or complex spatiotemporal scenes\. Without rigorous quantitative methods to characterize these retrieval mechanisms, understanding MLLM failure modes and internal efficiency remains limited\.

We find that within MLLMs,a specialized subset of attention heads consistently captures task\-relevant visual information during multimodal understanding, which we called CoRe head\.To identify these heads, we quantify the attention mass from context tokens to semantically relevant visual tokens\. We evaluate this across multiple complementary datasets covering different visual domains, including video\-based spatio\-temporal reasoning \(VidSTG\[[29](https://arxiv.org/html/2606.05843#bib.bib24)\]\), document layout understanding \(MMDocIR\[[6](https://arxiv.org/html/2606.05843#bib.bib5)\]\), object\-level visual grounding \(RefCOCOg\[[9](https://arxiv.org/html/2606.05843#bib.bib23)\]\), and long\-context multimodal reasoning \(MMLongBench\[[22](https://arxiv.org/html/2606.05843#bib.bib25)\]\)\. We further analyze a range of MLLMs, including Llava\-onevision\[[10](https://arxiv.org/html/2606.05843#bib.bib6)\], InternVL3\.5\[[21](https://arxiv.org/html/2606.05843#bib.bib7)\], and the Qwen3\-VL family\[[1](https://arxiv.org/html/2606.05843#bib.bib8)\], across different parameter scales\.

Our analysis reveals a distinct functional dichotomy where CoRe heads execute localized visual extraction while standard vision heads facilitate global feature aggregation\. Furthermore, we note that as model capacity scales, these CoRe heads progressively concentrate in middle\-to\-late layers\. Evaluations across diverse multimodal datasets reveal a globally consistent yet locally variant paradigm: approximately 30 specific heads remain universally activated, while others dynamically adapt to specific data distributions\. To rigorously ascertain their functional necessity, we conduct causal interventions via attention masking\. Crucially, ablating just the top 5% of CoRe heads causes a significant performance drop, confirming their indispensable role in cross\-modal visual retrieval\.

Spatial visualizations of attention weights further reveal a stark functional dichotomy during vision\-language reasoning\. As shown in Figures[1](https://arxiv.org/html/2606.05843#S1.F1)and[2](https://arxiv.org/html/2606.05843#S1.F2), across both image and video modalities, the top\-ranked CoRe heads precisely localize context\-relevant visual entities\. In contrast, lower\-ranked heads exhibit diffuse activation patterns, predominantly covering non\-salient background regions\. Corroborating our quantitative ablations, these observations confirm that fine\-grained, context\-aware visual selection is exclusively governed by a sparse subset of highly specialized heads\.

The discovery of CoRe heads offers profound implications for the mechanistic interpretability and optimization of MLLMs\. Our findings reveal a structured organization in multimodal processing, where a sparse, consistent subset of heads governs cross\-modal extraction across diverse modalities and scales\. As evidenced by our causal interventions, multimodal reasoning is highly sensitive to these specific heads, highlighting their non\-redundant necessity\. Crucially, this localized functional sparsity paves the way for efficient model optimization, suggesting that operating on a small fraction of critical heads can significantly accelerate inference while maintaining precise semantic selection\.

## 2Related Work

### 2\.1Mechanistic Interpretability of MLLMs

Recent mechanistic interpretability research on MLLMs shows that their perceptual and reasoning capabilities rely on specialized and sparsely activated attention heads\. Statistical and structural analyses of cross\-modal attention suggest that certain heads exhibit functional preferences\. Recent work suggests that certain attention heads in MLLMs are associated with functional roles such as visual perception and spatial reasoning\[[16](https://arxiv.org/html/2606.05843#bib.bib10)\]\. Quantitative analyses based on response scoring\[[19](https://arxiv.org/html/2606.05843#bib.bib11)\], signal\-based methods\[[2](https://arxiv.org/html/2606.05843#bib.bib18)\], and entropy measures\[[17](https://arxiv.org/html/2606.05843#bib.bib15)\]suggest that a subset of heads contributes disproportionately to task\-specific adaptation and visual representation encoding\. Furthermore, causal intervention techniques, including activation patching and representation editing, have been used to investigate candidate functional circuits \(e\.g\., visual counting\[[4](https://arxiv.org/html/2606.05843#bib.bib16)\]\) and hierarchical interactions between attention heads and feed\-forward networks\[[12](https://arxiv.org/html/2606.05843#bib.bib17)\]\. These mechanistic insights also enable training\-free interventions to improve model reliability and efficiency\. For instance, prior studies have investigated interventions such as causal head modulation\[[20](https://arxiv.org/html/2606.05843#bib.bib12)\], reweighting attention from visual sinks to informative heads\[[7](https://arxiv.org/html/2606.05843#bib.bib13)\], and exploiting abnormal attention patterns for zero\-shot hallucination detection\[[28](https://arxiv.org/html/2606.05843#bib.bib14)\]\. These methods demonstrate potential benefits in improving VQA performance, mitigating hallucinations, and reducing inference cost\. Despite these advances, existing analyses are largely limited to static datasets\. limiting their ability to capture fine\-grained attention patterns in dynamic or densely structured visual environments\.

### 2\.2Cross\-Modal Information Retrieval

Current MLLMs exhibit strong performance in complex multimodal reasoning tasks such as multi\-hop question answering\[[14](https://arxiv.org/html/2606.05843#bib.bib19)\]\. However, the mechanisms underlying their ability to locate and extract task\-relevant visual features in dense visual contexts remain poorly understood\. Existing literature primarily investigates this cross\-modal information retrieval through two distinct analytical lenses: head\-level structural sparsity and token\-level interpretability\. For the former, inspired by text\-only LLMs\[[23](https://arxiv.org/html/2606.05843#bib.bib20)\], researchers have identified dedicated subsets of attention heads that naturally govern visual grounding and localization behaviors\[[8](https://arxiv.org/html/2606.05843#bib.bib21),[25](https://arxiv.org/html/2606.05843#bib.bib22)\]\. Parallel to these macro\-level insights, finer\-grained token\-level approaches attempt to trace autoregressive generation directly to specific visual regions\[[5](https://arxiv.org/html/2606.05843#bib.bib29),[13](https://arxiv.org/html/2606.05843#bib.bib30)\]and employ explicit mechanisms like Vision\-Guided Attention to anchor generated text to tangible visual cues\[[30](https://arxiv.org/html/2606.05843#bib.bib31)\]\. Specifically, these token\-level strategies typically compute semantic alignment scores between textual instructions and visual patches, forcing the model to assign higher importance to relevant objects\. While effective for standard visual contexts, these techniques predominantly rely on static structural priors or post\-hoc attributions\. Consequently, existing approaches struggle to perform robust cross\-modal information retrieval in modern, dense, and dynamic datasets, where target features are deeply embedded and heavily obscured by complex, evolving background distractors\.

## 3Isolating CoRe Heads via Retrieval Attention Mass

To systematically isolate the attention heads responsible for precise visual retrieval and filtering background noise, we extend the text\-only probing framework of QRhead\[[27](https://arxiv.org/html/2606.05843#bib.bib26)\]to multimodal settings\. Our key idea is simple: a useful head should place more attention from the query tokens onto the*relevant visual regions*, not just anywhere in the sequence\. Based on this intuition, we define a token\-level metric calledRetrieval Attention Mass \(RAM\)to quantify how much attention a head allocates to the target visual content\. For each attention headhh, we define its RAM score as:

ℳRAM\(h\)​\(q→V∗\)=𝔼x∈q​\[∑y∈Ω​\(V∗\)𝐀x→y\(h\)\]\\mathcal\{M\}\_\{\\text\{RAM\}\}^\{\(h\)\}\(q\\rightarrow V^\{\*\}\)=\\mathbb\{E\}\_\{x\\in q\}\\left\[\\sum\_\{y\\in\\Omega\(V^\{\*\}\)\}\\mathbf\{A\}^\{\(h\)\}\_\{x\\to y\}\\right\]\(1\)
whereqqrepresents the set of instructional query tokens, andΩ​\(V∗\)\\Omega\(V^\{\*\}\)denotes the tokenized spatial or temporal span corresponding to the key visual entitiesV∗V^\{\*\}\. The term𝐀x→y\(h\)\\mathbf\{A\}^\{\(h\)\}\_\{x\\to y\}is the attention weight from tokenxxto tokenyyin headhh\. Intuitively, this metric measures how strongly a head retrieve information from the query to the relevant visual regions\. By averaging over all query tokens, RAM reflects the expected retrieval strength of a head\. Importantly, since we restrict the summation to the ground\-truth regionΩ​\(V∗\)\\Omega\(V^\{\*\}\), high RAM values indicate targeted retrieval, rather than generic attention accumulation\. This allows us to distinguish true retrieval heads from attention sinks that absorb probability mass without contributing a meaningful cross\-modal grounding\.

![Refer to caption](https://arxiv.org/html/2606.05843v1/x3.png)Figure 3:Overview of the CoRe head probing pipeline\. The input multimodal sequence is partitioned into general text tokens, key target visual tokens \(tvt\_\{v\}\), and instructional query tokens \(tqt\_\{q\}\)\. During MLLM inference, we extract the internal self\-attention maps\. By aggregating the attention weights directed fromtqt\_\{q\}totvt\_\{v\}, we compute a routing metric \(ℳRAM\(h\)\\mathcal\{M\}\_\{\\text\{RAM\}\}^\{\(h\)\}\) to isolate the sparse CoRe heads responsible for cross\-modal information retrieval\.As shown in Figure[3](https://arxiv.org/html/2606.05843#S3.F3), we first partition the multimodal input into three parts: text context, query tokens, and key visual tokens\. During a standard forward pass, we extract the full self\-attention mapsSoftmax​\(Q​K⊤d\)\\mathrm\{Softmax\}\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\}\}\)across all layers and heads\. We then focus only on the*text\-to\-vision*attention, i\.e\., the sub\-matrix where query tokens attend to visual tokens\. This step removes irrelevant intra\-modal interactions \(e\.g\., text\-to\-text or vision\-to\-vision attention\), significantly reducing noise\. By applying Equation \([1](https://arxiv.org/html/2606.05843#S3.E1)\), we aggregate these attention values into a scalar RAM score for each head\. Collectively, this produces a globalℳRAM\(h\)\\mathcal\{M\}\_\{\\text\{RAM\}\}^\{\(h\)\}distribution, where only a small subset of heads exhibit high scores\. We define these high\-scoring heads asCoRe Heads, as they are responsible for focused cross\-modal retrieval\. Detailed implementations and pseudocode are provided in Appendix[D\.2](https://arxiv.org/html/2606.05843#A4.SS2)\.

Table 1:CoRe Heads Detection Experimental ConfigurationUnified Evaluation Protocol\.To test whether CoRe heads generalize across tasks, we design a unified evaluation protocol covering four diverse multimodal datasets \(see Table[1](https://arxiv.org/html/2606.05843#S3.T1)and Appendix[B](https://arxiv.org/html/2606.05843#A2)for detailed descriptions\)\. To capture different types of retrieval, we define the targetV∗V^\{\*\}differently for each task: bounding boxes for spatial grounding \(RefCOCOg\), spatio\-temporal tubes for video grounding \(VidSTG\), multiple evidence regions for multi\-hop reasoning \(MMLongBench\), and document regions for multimodal retrieval \(MMDocIR\)\. Since these annotations are continuous \(in space or time\) but the model operates over discrete tokens, we map each target region to its corresponding token indicesΩ​\(V∗\)\\Omega\(V^\{\*\}\)\. The details of this mapping are provided in Appendix[C](https://arxiv.org/html/2606.05843#A3)\. This unified formulation allows us to evaluate whether CoRe heads implement a shared retrieval mechanism across modalities and tasks\.

#### Generalization Across Architectures and Scales

Finally, to verify that CoRe heads are not tied to a specific architecture, we conduct experiments across multiple model families, including Qwen3\-VL, LLaVA\-OneVision, and InternVL3\.5\. We also perform a scaling study using different sizes of Qwen3\-VL \(4B, 8B, 32B\)\. This combined analysis enables us to examine how the sparsity and stability of CoRe heads evolve with model design and scale, and whether they reflect an intrinsic property of multimodal transformers\.

## 4Mechanistic Analysis of CoRe heads

![Refer to caption](https://arxiv.org/html/2606.05843v1/x4.png)Figure 4:Evolution and structural divergence of CoRe heads on the MMDocIR dataset\. As model scale increases, attention patterns shift from broadly distributed activations \(4B\) to a pronounced deep\-layer bottleneck \(32B\)\. Cross\-architecture comparisons further reveal distinct attention topologies: Qwen3\-VL exhibits sparse localization, Llava\-onevision shows moderate dispersion, and InternVL3\.5 presents dense, widespread activations\.### 4\.1Distribution Dynamics across Architectures and Scales

#### Scaling Patterns within Model Families\.

As illustrated in Figure[4](https://arxiv.org/html/2606.05843#S4.F4), model capacity expansion triggers a transition from uniform dispersion to pronounced structural localization\. In smaller variants \(4B\), active CoRe heads are relatively scattered across the middle layers\. However, as the scale reaches 32B, these critical heads converge into a highly concentrated, contiguous block within the deep layers \(approximately layers 44–56\)\. This evolution indicates that larger models spontaneously develop a more specialized "bottleneck" for cross\-modal semantic integration, delegating complex visual\-linguistic retrieval to a compact subset of deep\-layer heads\.

#### Architectural Divergence across Model Families\.

Figure[4](https://arxiv.org/html/2606.05843#S4.F4)reveals stark contrasts in attention allocation across models\. Qwen3\-VL exhibits a highly sparse and localized activation structure, whereas Llava\-onevision demonstrates a moderately dense and vertically dispersed pattern\. Conversely, InternVL3\.5 displays pervasive, high\-density activations spanning nearly the entire network depth\. We attribute these structural discrepancies to three primary factors: \(1\)Vision Encoder Capability:InternVL3\.5 employs a massive encoder to achieve deep semantic pre\-alignment before projection, allowing the LLM to process visual tokens as "native" semantic units with high\-density, layer\-wise interaction\. Lighter encoders shift the modality\-bridging burden to the LLM, forcing the emergence of specialized deep\-layer hubs\. \(2\)Feature Alignment Paradigm:The multi\-stage tuning of Llava\-onevision fosters progressive fusion, while the native joint optimization in Qwen3\-VL induces sparse specialization to preserve linguistic reasoning capacity\. \(3\)Base Model Adaptation:Notably, despite sharing the Qwen backbone, Qwen3\-VL and Llava\-onevision exhibit distinct topologies\. This confirms that the structural sparsity of multimodal integration is not an intrinsic property of the base LLM, but a dynamic allocation strategy responsive to the density of visual representations and the specific alignment objectives\.

### 4\.2Functional Decoupling: Semantic vs\. Global Processing

#### Experimental Setting

To further elucidate the functional heterogeneity among internal attention components, we evaluate and contrast two distinct categories of attention heads: \(1\)CoRe heads, which are systematically isolated using our proposed cross\-modal attention allocation metric, and \(2\)Vision Heads, which are heuristically identified through the direct statistical aggregation of the macroscopic attention mass directed from linguistic query tokens to the entire set of visual tokens\.

#### Result analysis

From a distributional perspective, the two categories exhibit markedly distinct structural characteristics\. As illustrated in the top\-left panel of Figure[4](https://arxiv.org/html/2606.05843#S4.F4), the activation profiles of CoRe Heads are highly sparse and localized\. These heads govern crucial cross\-modal information selection, selectively executing the extraction of key visual features within specific layers\. In contrast, the attention distribution of the baseline Vision Heads \(depicted in the lower panel\) is continuous and diffuse\. It exhibits robust responses across the middle\-to\-late layers with substantially broader coverage along the head dimension\. This indicates that their functionality is predominantly oriented toward global visual information aggregation, facilitating the holistic encoding and propagation of visual signals\. This phenomenon reveals the spontaneous emergence of a structured functional decomposition mechanism within the network\. Rather than relying on homogeneously distributed attention computations, MLLMs achieve efficient cross\-modal retrieval through a critical minority of specialized heads, while leveraging a vast ensemble of auxiliary heads to ensure the stable propagation and integration of multimodal information\.

### 4\.3Cross\-Domian Stability of CoRe heads

![Refer to caption](https://arxiv.org/html/2606.05843v1/x5.png)\(a\)Spearman correlation of head activations\.
![Refer to caption](https://arxiv.org/html/2606.05843v1/x6.png)\(b\)Spatial distribution of highly activated heads\.

Figure 5:Stability of attention heads across multi\-modal tasks\. \(a\) The Spearman rank correlation across distinct datasets exhibits high consistency, indicating a shared information retrieval mechanism\. \(b\) The layer\-head stability matrix illustrates the distribution of task\-agnostic anchor heads\. Heads consistently ranked in the top 5% across all tasks are localized in the middle\-to\-late layers\.#### Experimental Setting

To investigate the stability of CoRe heads across heterogeneous multimodal distributions, we extract head\-level activation scores from Qwen3\-VL\-4B on RefCOCOg, MMLongBench, VidSTG, and MMDocIR\. We measure cross\-task consistency by computing pairwise Spearman rank correlations between flattened head activation vectors, capturing the agreement in relative head importance across tasks\. And for each task, we identify heads within the top 5% of activation scores and aggregate their occurrence frequency across all benchmarks\. These frequencies are then mapped onto a layer–head grid to visualize the cross\-task persistence of salient heads\.Detailed implementations and pseudocode are provided in Appendix[E\.1](https://arxiv.org/html/2606.05843#A5.SS1)\.

#### Result analysis

As illustrated in Figure[5](https://arxiv.org/html/2606.05843#S4.F5), the identified CoRe heads exhibit consistent global structural characteristics across diverse datasets\. As depicted in Figure[5\(a\)](https://arxiv.org/html/2606.05843#S4.F5.sf1), strong positive correlations are observed across all task pairs, with coefficients ranging from 0\.79 to 0\.92\. Notably, tasks demanding complex structural understanding or long\-context reasoning \(e\.g\., MMLongBench and VidSTG\) demonstrate an exceptionally high correlation \(0\.92\)\. This pronounced correlation suggests that the model universally repurposes a highly overlapping set of specialized attention heads to execute cross\-modal alignment across varied downstream applications\. Furthermore, as shown in Figure[5\(b\)](https://arxiv.org/html/2606.05843#S4.F5.sf2), the majority of CoRe heads are stably concentrated within the middle layers of the architecture, predominantly distributed between layer 13 and layer 24\. Approximately 30 specific heads remain highly activated across all four tasks, with only a marginal fraction of heads exhibiting dataset\-specific activation variance\. Consequently, the CoRe heads exhibit a "globally consistent, locally variant" distributional paradigm\. While the core CoRe head topology demonstrates robust cross\-domain generalizability, its precise activation patterns undergo adaptive modulation contingent on the target data distribution\. Detailed heatmaps for various configurations are provided in Appendix[F](https://arxiv.org/html/2606.05843#A6)\.

### 4\.4Causal Impact and Information Sparsity

#### Experimental Setting

To evaluate the functional role and visual information extraction efficiency of the CoRe heads, we conduct both intervention and quantitative analyses on the Qwen3\-VL\-4B and Llava\-onevision architectures\. For causal validation, we perform attention head masking during inference by ablatingk∈5,10,20,30k\\in\{5,10,20,30\}heads and measuring the resulting performance degradation on MMLongBench\. We consider three ablation strategies: masking the Top\-kkheads ranked byℳRAM\(h\)\\mathcal\{M\}\_\{\\text\{RAM\}\}^\{\(h\)\}, masking the Bottom\-kkheads, and randomly maskingkkheads\. In parallel, to assess visual information extraction efficiency, we introduce two metrics: Key Token Ratio, which evaluates the precision of individual heads by measuring the overlap between their top 5% attended tokens and ground\-truth critical visual patches, and Key Token Coverage, which measures the collective coverage of such patches by aggregating the top attended tokens across head groups\.Detailed implementations and pseudocode are provided in Appendices[E\.2](https://arxiv.org/html/2606.05843#A5.SS2),[E\.3](https://arxiv.org/html/2606.05843#A5.SS3), and[E\.4](https://arxiv.org/html/2606.05843#A5.SS4)\.

![Refer to caption](https://arxiv.org/html/2606.05843v1/x7.png)\(a\)Causal impact of head masking\.
![Refer to caption](https://arxiv.org/html/2606.05843v1/x8.png)\(b\)Key visual token concentration\.
![Refer to caption](https://arxiv.org/html/2606.05843v1/x9.png)\(c\)Rapid token coverage saturation\.

Figure 6:Quantitative analysis of the causal impact and structural sparsity of CoRe heads in MLLMs\. \(a\) Performance degradation under head masking reveals that top\-ranked CoRe heads are causally indispensable for complex multimodal reasoning tasks, whereas bottom\-ranked heads yield minimal impact\. \(b\) Key token ratios demonstrate that critical visual semantics are highly concentrated within a compact subset of elite heads\. \(c\) The rapid saturation of cumulative token coverage confirms the structural sparsity of localized cross\-modal feature extraction\.
#### Assessing Causal Impact via Attention Head Intervention

As illustrated in Figure[6\(a\)](https://arxiv.org/html/2606.05843#S4.F6.sf1), a pronounced performance divergence among the different ablation strategies emerges as the number of masked heads increases\. Masking the Top\-kkCoRe heads \(red lines\) induces a precipitous and catastrophic degradation in model performance\. For instance, ablating merely the top 5 heads causes the Accuracy of the Qwen model to plummet from approximately 45\.3 to 27\.0, and that of the Llava model from 16\.4 to 10\.6\. When the mask size expands tok=30k=30, the multimodal comprehension capabilities of both models effectively collapse, with accuracy scores dropping below 7\. This phenomenon strongly validates that the identified CoRe heads govern critical information aggregation and cross\-modal retrieval mechanisms during inference\. Conversely, masking the Bottom\-kkheads \(blue lines\) yields only a marginal performance drop, exhibiting an ablation trajectory that is more robust than the random masking baseline\. This performance gap demonstrates that the top\-ranked CoRe heads are not merely correlated with multimodal semantic integration; rather, they are causally indispensable for complex vision\-language understanding tasks\. They function as a highly concentrated informational bottleneck within the network architecture\. In contrast, heads with lower importance scores contribute minimally to cross\-modal feature interaction, indicating a significant degree of functional redundancy\.

#### Emergent Sparsity in Multimodal Models

As illustrated in Figure[6\(b\)](https://arxiv.org/html/2606.05843#S4.F6.sf2), the Top 15 CoRe heads concentrate a remarkably high proportion of critical visual tokens, reaching 14\.4% and 25\.6% for the Llava and Qwen models, whereas the Bottom 15 heads capture a negligible fraction of approximately 2%\. This contrast validates that the top\-ranked CoRe heads function as highly efficient, localized hubs for cross\-modal feature interaction, selectively distilling salient visual semantics\. Figure[6\(c\)](https://arxiv.org/html/2606.05843#S4.F6.sf3)further reveals that the cumulative coverage trajectories for both models exhibit a rapid initial ascent followed by swift saturation\. Notably, aggregating merely the top 50 to 100 heads is sufficient to encompass 70% to 80% of the critical visual tokens\. This rapid saturation phenomenon provides compelling quantitative evidence for the structural sparsity of multimodal retrieval\. It indicates that the MLLMs avoids distributing critical visual processing uniformly; instead, it delegates core feature extraction mechanisms to a remarkably compact subset of elite CoRe heads\.

Table 2:Performance comparison of different models across multiple tasks\. Best results for each model are highlighted inbold\. The subscript indicates the absolute difference compared to the corresponding Dense\.The overall performances on MLVU dev set, including the holistic LVU tasks \(TR: Topic Reasoning\), the single\-detail LVU tasks \(NQA: Needle QA, ER: Ego Reasoning, PQA: Plot QA\), and multi\-detail LVU tasks \(AO: Action Order, AC: Action Count\)\.

## 5System\-Level Acceleration via CoRe head\-Guided Sparsity

### 5\.1Methodology and Experimental Setup

#### CoRe head\-Guided Hybrid Attention Paradigm

To empirically validate the structural sparsity of CoRe heads and their potential for accelerating inference, we implement a head\-level hybrid attention strategy\. Conventional MLLMs inherently suffer from quadratic computational complexity during the prefill phase of long visual contexts\. Motivated by our finding that cross\-modal semantic integration is highly concentrated within a sparse subset of CoRe heads, we apply a deterministic attention allocation mask to mitigate this bottleneck\. Specifically, during the prefill stage, we inject a static head configuration into the attention forward pass\. For the top\-kkcritical CoRe heads, we retain the standard Full Attention formulation to preserve their capacity for dense, global multimodal semantic extraction\. Conversely, for the remaining non\-essential vision heads, we fallback to a Stream Sparse Attention mechanism\[[24](https://arxiv.org/html/2606.05843#bib.bib27)\], restricting their computations strictly within localized sliding windows\. Detailed implementations and pseudocode are provided in Appendices[G\.1](https://arxiv.org/html/2606.05843#A7.SS1)and[G\.2](https://arxiv.org/html/2606.05843#A7.SS2)\.

#### Baselines and Configurations

We systematically ablate the proportion of attention heads retained for Full Attention computation\. Specifically, we vary the proportion of top\-ranked CoRe heads that are retained in a dense attention state, while all remaining heads are constrained to operate under Stream Sparse Attention\. Our evaluations are conducted across the Qwen3\-VL \(8B and 32B\) and Llava\-onevision architectures, establishing the standard, unmodified models \(utilizing 100% Full Attention\) as our Dense baselines\. To precisely monitor the performance\-efficiency trade\-offs across different cognitive granularities, we comprehensively evaluate the models on diverse subsets of the MLVU benchmark\[[31](https://arxiv.org/html/2606.05843#bib.bib28)\], which effectively decouples multimodal comprehension into holistic tasks, single\-detail reasoning, and multi\-detail spatiotemporal perception\. Due to space constraints, the evaluation results on the VideoMME benchmark\[[18](https://arxiv.org/html/2606.05843#bib.bib1)\]are deferred to Appendix[G\.3](https://arxiv.org/html/2606.05843#A7.SS3), which exhibit a consistent trend with those observed on MLVU\.

### 5\.2Results and Analysis

#### Efficacy of System\-Level Acceleration

Table[2](https://arxiv.org/html/2606.05843#S4.T2)presents the quantitative evaluation of our CoRe head\-guided hybrid attention paradigm across multiple model architectures\. The empirical results demonstrate that dense global attention is highly redundant for the majority of attention heads\. Across all evaluated models, our proposed structural sparsity mechanism consistently achieves significant prefill speedups\. Remarkably, under specific configurations such as retaining19\.1%19\.1\\%of the CoRe heads in Llava\-onevision and24\.4%24\.4\\%in Qwen3\-VL\-8B, this hybrid paradigm not only realizes a1\.8×1\.8\\timesacceleration but also marginally outperforms the fully Dense baselines in overall average performance\. These findings substantiate our hypothesis that cross\-modal information retrieval is highly concentrated within a critical subset of CoRe heads; strictly pruning the receptive fields of non\-essential attention heads effectively circumvents computational bottlenecks without compromising the representational integrity of the model\. As shown in Figure[7](https://arxiv.org/html/2606.05843#S5.F7), the latency advantage becomes increasingly pronounced as sequence length grows, highlighting the scalability of our approach\.

![Refer to caption](https://arxiv.org/html/2606.05843v1/x10.png)Figure 7:Our CoRe\-Guided Hybrid approach consistently achieves lower latency compared to the dense baseline\(Qwen3\-VL\-8B\), with the gap widening as sequence length increases, demonstrating better scalability for long sequences\. The inset highlights performance in the short\-sequence regime\.
#### Granular Impact on Multimodal Comprehension

A detailed analysis across tasks of varying cognitive granularities within the MLVU benchmark reveals task\-specific model behaviors under the sparse paradigm\. For holistic reasoning and broad detail extraction tasks, The introduction of the hybrid attention mechanism results in the performance drop typically remains within a minimal range of 1% to 3%, even under aggressive sparsity settings \(e\.g\., retaining fewer than 5% of attention heads\)\. Conversely, for precise single\-detail reasoning \(ER\) and complex multi\-detail spatiotemporal perception tasks \(AO, AC\), the introduction of structural sparsity frequently yields performance improvements\. We attribute this counterintuitive phenomenon to an implicit regularization effect: restricting the computation of non\-essential attention heads to localized sliding windows intrinsically filters out cross\-modal noise in long sequences, thereby enhancing the model’s spatiotemporal focus on precise visual cues\.

#### Robustness Across Model Scales

Furthermore, as model capacity increases, CoRe head\-guided attention exhibits improved robustness under sparsity constraints\. Comparing the 8B and 32B variants of Qwen3\-VL, we observe that larger models can sustain substantially higher levels of head sparsification with only minor performance degradation\. In particular, Qwen3\-VL\-32B remains effective even when dense computation is reduced to 4\.9% of attention heads, achieving a 2\.1× inference speedup with an average performance drop of only 0\.7 points\. These results suggest that scaling enhances redundancy in attention allocation, thereby allowing more aggressive yet stable removal of non\-critical heads\. Consequently, our deterministic head masking strategy enables controlled isolation of task\-relevant CoRe heads and provides a simple and effective mechanism for accelerating inference in large multimodal models without significant loss in performance\.

## 6Conclusion

In this work, we investigate the mechanistic basis of cross\-modal information retrieval in Multimodal Large Language Models by identifying CoRe heads, a sparse subset of attention heads responsible for query\-relevant visual extraction\. Our analyses reveal a clear functional dichotomy: CoRe heads precisely localize relevant entities, while the vast majority of heads exhibit diffuse, global attention patterns\. Through causal interventions, we establish the non\-redundant necessity of these specific heads for robust multimodal reasoning\. Furthermore, our acceleration experiments validate the practical utility of this phenomenon, demonstrating that selectively preserving CoRe heads significantly expedites inference without compromising task performance\. While preliminary, we hope these findings provide insights into the mechanistic interpretability of multimodal models and inspire further work on controllable efficiency\.

## References

- \[1\]S\. Bai, Y\. Cai, R\. Chen, K\. Chen, X\. Chen, Z\. Cheng, L\. Deng, W\. Ding, C\. Gao, C\. Ge, W\. Ge, Z\. Guo, Q\. Huang, J\. Huang, F\. Huang, B\. Hui, S\. Jiang, Z\. Li, M\. Li, M\. Li, K\. Li, Z\. Lin, J\. Lin, X\. Liu, J\. Liu, C\. Liu, Y\. Liu, D\. Liu, S\. Liu, D\. Lu, R\. Luo, C\. Lv, R\. Men, L\. Meng, X\. Ren, X\. Ren, S\. Song, Y\. Sun, J\. Tang, J\. Tu, J\. Wan, P\. Wang, P\. Wang, Q\. Wang, Y\. Wang, T\. Xie, Y\. Xu, H\. Xu, J\. Xu, Z\. Yang, M\. Yang, J\. Yang, A\. Yang, B\. Yu, F\. Zhang, H\. Zhang, X\. Zhang, B\. Zheng, H\. Zhong, J\. Zhou, F\. Zhou, J\. Zhou, Y\. Zhu, and K\. Zhu\(2025\)Qwen3\-vl technical report\.arXiv preprint arXiv:2511\.21631\.Cited by:[Table 3](https://arxiv.org/html/2606.05843#A4.T3.2.2.3.1),[§1](https://arxiv.org/html/2606.05843#S1.p3.1)\.
- \[2\]L\. Basile, V\. Maiorca, D\. Doimo, F\. Locatello, and A\. Cazzaniga\(2025\)Head pursuit: probing attention specialization in multimodal transformers\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=WQ9rnkaUWm)Cited by:[§1](https://arxiv.org/html/2606.05843#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.05843#S2.SS1.p1.1)\.
- \[3\]J\. Bi, J\. Guo, Y\. Tang, L\. B\. Wen, Z\. Liu, B\. Wang, and C\. Xu\(2025\)Unveiling visual perception in language models: an attention head analysis approach\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 4135–4144\.Cited by:[§1](https://arxiv.org/html/2606.05843#S1.p2.1)\.
- \[4\]L\. Che, Z\. Xue, Y\. Quan, B\. Liu, Z\. Shi, M\. Hurst, J\. Feldman, R\. Tang, R\. Krishna, and V\. Pavlovic\(2026\)Counting circuits: mechanistic interpretability of visual reasoning in large vision\-language models\.arXiv preprint arXiv:2603\.18523\.Cited by:[§2\.1](https://arxiv.org/html/2606.05843#S2.SS1.p1.1)\.
- \[5\]R\. Chen, X\. Guo, K\. Liu, S\. Liang, S\. Liu, Q\. Zhang, L\. Wang, H\. Zhang, and X\. Cao\(2026\)Where mllms attend and what they rely on: explaining autoregressive token generation\.External Links:2509\.22496,[Link](https://arxiv.org/abs/2509.22496)Cited by:[§2\.2](https://arxiv.org/html/2606.05843#S2.SS2.p1.1)\.
- \[6\]K\. Dong, Y\. Chang, D\. G\. X\. Deik, D\. Li, R\. Tang, and Y\. Liu\(2025\)MMDocIR: benchmarking multimodal retrieval for long documents\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 30959–30993\.Cited by:[Appendix B](https://arxiv.org/html/2606.05843#A2.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.05843#S1.p3.1),[Table 1](https://arxiv.org/html/2606.05843#S3.T1.4.4.5.1.1)\.
- \[7\]S\. Kang, J\. Kim, J\. Kim, and S\. J\. Hwang\(2025\)See what you are told: visual attention sink in large multimodal models\.arXiv preprint arXiv:2503\.03321\.Cited by:[§2\.1](https://arxiv.org/html/2606.05843#S2.SS1.p1.1)\.
- \[8\]S\. Kang, J\. Kim, J\. Kim, and S\. J\. Hwang\(2025\)Your large vision\-language model only needs a few attention heads for visual grounding\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 9339–9350\.Cited by:[§2\.2](https://arxiv.org/html/2606.05843#S2.SS2.p1.1)\.
- \[9\]S\. Kazemzadeh, V\. Ordonez, M\. Matten, and T\. Berg\(2014\-10\)ReferItGame: referring to objects in photographs of natural scenes\.InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),A\. Moschitti, B\. Pang, and W\. Daelemans \(Eds\.\),Doha, Qatar,pp\. 787–798\.External Links:[Link](https://aclanthology.org/D14-1086),[Document](https://dx.doi.org/10.3115/v1/D14-1086)Cited by:[Appendix B](https://arxiv.org/html/2606.05843#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05843#S1.p3.1),[Table 1](https://arxiv.org/html/2606.05843#S3.T1.2.2.2.2)\.
- \[10\]B\. Li, Y\. Zhang, D\. Guo, R\. Zhang, F\. Li, H\. Zhang, K\. Zhang, P\. Zhang, Y\. Li, Z\. Liu, and C\. Li\(2025\)LLaVA\-onevision: easy visual task transfer\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=zKv8qULV6n)Cited by:[Table 3](https://arxiv.org/html/2606.05843#A4.T3.2.4.1.1.1),[§1](https://arxiv.org/html/2606.05843#S1.p3.1)\.
- \[11\]J\. Li, D\. Li, S\. Savarese, and S\. Hoi\(2023\)Blip\-2: bootstrapping language\-image pre\-training with frozen image encoders and large language models\.InInternational conference on machine learning,pp\. 19730–19742\.Cited by:[§1](https://arxiv.org/html/2606.05843#S1.p1.1)\.
- \[12\]Q\. Li, Z\. Ye, X\. Feng, W\. Zhong, W\. Ma, and X\. Feng\(2026\)Causal tracing of object representations in large vision language models: mechanistic interpretability and hallucination mitigation\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 31645–31653\.Cited by:[§1](https://arxiv.org/html/2606.05843#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.05843#S2.SS1.p1.1)\.
- \[13\]J\. Liang, R\. Chen, X\. Jiao, S\. Liang, S\. Liu, Q\. Zhang, Z\. Hu, and X\. Cao\(2025\)Explaining multimodal llms via intra\-modal token interactions\.arXiv preprint arXiv:2509\.22415\.Cited by:[§2\.2](https://arxiv.org/html/2606.05843#S2.SS2.p1.1)\.
- \[14\]Q\. Z\. Lim, C\. P\. Lee, K\. M\. Lim, and K\. S\. M\. Anbananthen\(2025\)VLMT: vision\-language multimodal transformer for multimodal multi\-hop question answering\.ArXivabs/2504\.08269\.External Links:[Link](https://api.semanticscholar.org/CorpusID:277741397)Cited by:[§2\.2](https://arxiv.org/html/2606.05843#S2.SS2.p1.1)\.
- \[15\]H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee\(2023\)Visual instruction tuning\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=w0H2xGHlkw)Cited by:[§1](https://arxiv.org/html/2606.05843#S1.p1.1)\.
- \[16\]X\. Ma, S\. Yang, Y\. Jiang, S\. Liu, Z\. Liu, J\. Ao, X\. Ma, S\. M\. Erfani, and J\. Bailey\(2026\)Attention in space: functional roles of vlm heads for spatial reasoning\.arXiv preprint arXiv:2603\.20662\.Cited by:[§1](https://arxiv.org/html/2606.05843#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.05843#S2.SS1.p1.1)\.
- \[17\]Y\. Ma, H\. Yang, L\. Z\. Wang, B\. Chen, W\. Xian, and J\. Teng\(2026\)DeAR: fine\-grained vlm adaptation by decomposing attention head roles\.arXiv preprint arXiv:2603\.01111\.Cited by:[§1](https://arxiv.org/html/2606.05843#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.05843#S2.SS1.p1.1)\.
- \[18\]Y\. Tang, J\. Bi, S\. Xu, L\. Song, S\. Liang, T\. Wang, D\. Zhang, J\. An, J\. Lin, R\. Zhu,et al\.\(2025\)Video understanding with large language models: a survey\.IEEE Transactions on Circuits and Systems for Video Technology\.Cited by:[§1](https://arxiv.org/html/2606.05843#S1.p1.1),[§5\.1](https://arxiv.org/html/2606.05843#S5.SS1.SSS0.Px2.p1.1)\.
- \[19\]J\. Wang, Z\. Liu, Y\. Rao, and J\. Lu\(2025\)Sparsemm: head sparsity emerges from visual concept responses in mllms\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 23177–23187\.Cited by:[§2\.1](https://arxiv.org/html/2606.05843#S2.SS1.p1.1)\.
- \[20\]Q\. Wang, J\. Hu, and M\. Jiang\(2025\)V\-seam: visual semantic editing and attention modulating for causal interpretability of vision\-language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 17407–17431\.Cited by:[§2\.1](https://arxiv.org/html/2606.05843#S2.SS1.p1.1)\.
- \[21\]W\. Wang, Z\. Gao, L\. Gu, H\. Pu, L\. Cui, X\. Wei, Z\. Liu, L\. Jing, S\. Ye, J\. Shao,et al\.\(2025\)InternVL3\.5: advancing open\-source multimodal models in versatility, reasoning, and efficiency\.arXiv preprint arXiv:2508\.18265\.Cited by:[Table 3](https://arxiv.org/html/2606.05843#A4.T3.2.5.2.1.1),[§1](https://arxiv.org/html/2606.05843#S1.p3.1)\.
- \[22\]Z\. Wang, W\. Yu, X\. Ren, J\. Zhang, Y\. Zhao, R\. Saxena, L\. Cheng, G\. Wong, S\. See, P\. Minervini, Y\. Song, and M\. Steedman\(2025\)MMLongBench: benchmarking long\-context vision\-language models effectively and thoroughly\.InThe 39th \(2025\) Annual Conference on Neural Information Processing Systems,External Links:2505\.10610,[Link](https://arxiv.org/abs/2505.10610)Cited by:[Appendix B](https://arxiv.org/html/2606.05843#A2.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.05843#S1.p3.1),[Table 1](https://arxiv.org/html/2606.05843#S3.T1.4.4.4.2)\.
- \[23\]W\. Wu, Y\. Wang, G\. Xiao, H\. Peng, and Y\. Fu\(2025\)RETRIEVAL head mechanistically explains long\-context factuality\.In13th International Conference on Learning Representations, ICLR 2025,pp\. 33762–33775\.Cited by:[§2\.2](https://arxiv.org/html/2606.05843#S2.SS2.p1.1)\.
- \[24\]G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis\(2024\)Efficient streaming language models with attention sinks\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=NG7sS51zVF)Cited by:[§5\.1](https://arxiv.org/html/2606.05843#S5.SS1.SSS0.Px1.p1.1)\.
- \[25\]J\. Xie, P\. Pan, and X\. Zhang\(2026\)Head\-aware visual cropping: enhancing fine\-grained vqa with attention\-guided subimage\.arXiv preprint arXiv:2601\.22483\.Cited by:[§2\.2](https://arxiv.org/html/2606.05843#S2.SS2.p1.1)\.
- \[26\]S\. Yin, C\. Fu, S\. Zhao, K\. Li, X\. Sun, T\. Xu, and E\. Chen\(2024\)A survey on multimodal large language models\.National Science Review11\(12\),pp\. nwae403\.Cited by:[§1](https://arxiv.org/html/2606.05843#S1.p1.1)\.
- \[27\]W\. Zhang, F\. Yin, H\. Yen, D\. Chen, and X\. Ye\(2025\)Query\-focused retrieval heads improve long\-context reasoning and re\-ranking\.InProceedings of EMNLP,Cited by:[§3](https://arxiv.org/html/2606.05843#S3.p1.1)\.
- \[28\]Y\. Zhang, R\. Xie, X\. Sun, Y\. Huang, J\. Chen, Z\. Kang, D\. Wang, and Y\. Wang\(2025\)Dhcp: detecting hallucinations by cross\-modal attention pattern in large vision\-language models\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 3555–3564\.Cited by:[§2\.1](https://arxiv.org/html/2606.05843#S2.SS1.p1.1)\.
- \[29\]Z\. Zhang, Z\. Zhao, Y\. Zhao, Q\. Wang, H\. Liu, and L\. Gao\(2020\)Where does it exist: spatio\-temporal video grounding for multi\-form sentences\.InCVPR,Cited by:[Appendix B](https://arxiv.org/html/2606.05843#A2.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2606.05843#S1.p3.1),[Table 1](https://arxiv.org/html/2606.05843#S3.T1.3.3.3.2)\.
- \[30\]J\. Zhao, F\. Zhang, X\. Sun, C\. Feng, and Z\. Tan\(2025\)Tell model where to look: mitigating hallucinations in mllms by vision\-guided attention\.arXiv preprint arXiv:2511\.20032\.Cited by:[§2\.2](https://arxiv.org/html/2606.05843#S2.SS2.p1.1)\.
- \[31\]J\. Zhou, Y\. Shu, B\. Zhao, B\. Wu, S\. Xiao, X\. Yang, Y\. Xiong, B\. Zhang, T\. Huang, and Z\. Liu\(2024\)MLVU: benchmarking multi\-task long video understanding\.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 13691–13701\.External Links:[Link](https://api.semanticscholar.org/CorpusID:270286192)Cited by:[§5\.1](https://arxiv.org/html/2606.05843#S5.SS1.SSS0.Px2.p1.1)\.

## Appendix ACode Availability

## Appendix BDataset Statistics

To evaluate the cross\-modal retrieval capabilities of CoRe heads across diverse scenarios, we select four datasets that cover static images, video sequences, dense documents, and multimodal input in long\-context\.

#### RefCOCOg

RefCOCOg\[[9](https://arxiv.org/html/2606.05843#bib.bib23)\]is a large\-scale benchmark for complex referring expression comprehension in static images\.\(1\) Providing high\-precision visual anchors:Its exhaustive bounding box annotations establish a reliable baseline to precisely compute attention distributions from complex queries to target regions\.\(2\) Introducing static spatial robustness validation:The presence of similar distracting entities severely challenges target disambiguation, proving that CoRe heads maintain precise filtering in crowded static scenes\.\(3\) Encompassing highly complex semantic interactions:Queries with rich attributes and spatial relationships validate that these heads execute fine\-grained spatial reasoning rather than shallow noun matching\.

#### MMDocIR

MMDocIR\[[6](https://arxiv.org/html/2606.05843#bib.bib5)\]targets multimodal document information retrieval in highly dense visual environments\.\(1\) Providing high\-precision visual anchors:Annotations of specific layout regions help isolate core evidence from massive text and chart noise\.\(2\) Introducing dense layout robustness validation:The extreme visual clutter of long documents elevates the evaluation rigor, demonstrating the heads’ robust anti\-noise mechanisms\.\(3\) Encompassing highly complex semantic interactions:Structured queries requiring cross\-chart parsing verify that these heads possess deep semantic alignment capabilities for complex document topologies, moving beyond simple OCR\.

#### MMLongBench

MMLongBench\[[22](https://arxiv.org/html/2606.05843#bib.bib25)\]focuses on extreme long\-context multimodal reasoning across massive visual sequences\.\(1\) Providing high\-precision visual anchors:Annotations of sparsely distributed evidence enable the precise computation of attention distributions while stripping away extensive contextual noise\.\(2\) Introducing long\-range dependency robustness validation:Severe attention dilution in massive sequences rigorously proves that CoRe heads sustain stable, high\-recall retrieval mechanisms\.\(3\) Encompassing highly complex semantic interactions:Queries demanding multi\-hop reasoning confirm that these heads execute profound multi\-evidence cross\-modal routing rather than localized feature matching\.

#### VidSTG

VidSTG\[[29](https://arxiv.org/html/2606.05843#bib.bib24)\]is a large\-scale benchmark for video spatio\-temporal grounding, requiring dual localization within video streams\.\(1\) Providing high\-precision visual anchors:Its frame\-level bounding box annotations provide a robust baseline to isolate background noise and precisely compute query\-to\-target attention\.\(2\) Introducing dynamic robustness validation:The inherent spatio\-temporal dynamics of videos increase evaluation rigor, proving that CoRe heads maintain stable alignment amidst evolving visual inputs\.\(3\) Encompassing highly complex semantic interactions:Queries involving actions and relationships verify that these heads execute fine\-grained cross\-modal routing beyond simple entity matching\.

## Appendix CToken Mapping Rules

To measure the attention allocated to target visual regions by the CoRe heads, we map continuous spatial annotations to discrete 1D token indices within the language model sequence\.

### C\.1Mapping Strategy for Spatial Visual Entities in RefCOCOg

For the RefCOCOg dataset, the ground\-truth visual entityV∗V^\{\*\}is provided as a continuous bounding box\.

Let the original imageℐ\\mathcal\{I\}have a resolution ofW×HW\\times H, and the target object be localized by a bounding box defined by its top\-left and bottom\-right coordinates:B=\[xm​i​n,ym​i​n,xm​a​x,ym​a​x\]B=\[x\_\{min\},y\_\{min\},x\_\{max\},y\_\{max\}\]\. The vision encoder partitions the image into a 2D patch grid of dimensionsWg​r​i​d×Hg​r​i​dW\_\{grid\}\\times H\_\{grid\}\. We compute the scaling factors along the spatial axes assx=Wg​r​i​dWs\_\{x\}=\\frac\{W\_\{grid\}\}\{W\}andsy=Hg​r​i​dHs\_\{y\}=\\frac\{H\_\{grid\}\}\{H\}\. The continuous coordinates ofBBare then projected onto the discrete token grid\. To ensure complete coverage of the target region without exceeding the image boundaries, the projected grid coordinates are defined as:

xm​i​n′=max⁡\(0,⌊xm​i​n⋅sx⌋\),ym​i​n′=max⁡\(0,⌊ym​i​n⋅sy⌋\)x^\{\\prime\}\_\{min\}=\\max\(0,\\lfloor x\_\{min\}\\cdot s\_\{x\}\\rfloor\),\\quad y^\{\\prime\}\_\{min\}=\\max\(0,\\lfloor y\_\{min\}\\cdot s\_\{y\}\\rfloor\)\(2\)xm​a​x′=min⁡\(Wg​r​i​d−1,⌊xm​a​x⋅sx⌋\),ym​a​x′=min⁡\(Hg​r​i​d−1,⌊ym​a​x⋅sy⌋\)x^\{\\prime\}\_\{max\}=\\min\(W\_\{grid\}\-1,\\lfloor x\_\{max\}\\cdot s\_\{x\}\\rfloor\),\\quad y^\{\\prime\}\_\{max\}=\\min\(H\_\{grid\}\-1,\\lfloor y\_\{max\}\\cdot s\_\{y\}\\rfloor\)\(3\)
These 2D visual patches are then flattened into a 1D sequence in row\-major order\. For any target patch located at grid coordinates\(x′,y′\)\(x^\{\\prime\},y^\{\\prime\}\)wherex′∈\[xm​i​n′,xm​a​x′\]x^\{\\prime\}\\in\[x^\{\\prime\}\_\{min\},x^\{\\prime\}\_\{max\}\]andy′∈\[ym​i​n′,ym​a​x′\]y^\{\\prime\}\\in\[y^\{\\prime\}\_\{min\},y^\{\\prime\}\_\{max\}\], its relative 1D index within the visual sequence is:

ir​e​l=y′⋅Wg​r​i​d\+x′i\_\{rel\}=y^\{\\prime\}\\cdot W\_\{grid\}\+x^\{\\prime\}\(4\)
Multimodal LLMs like Qwen\-VL use a vision\-language adapter to compress the visual sequence, often merging multiple adjacent patches into a single visual token\. Letccdenote this downsampling factor \(e\.g\.,c=4c=4for a2×22\\times 2pooling mechanism\)\. The effective relative index of the visual token becomes⌊ir​e​lc⌋\\lfloor\\frac\{i\_\{rel\}\}\{c\}\\rfloor\. Assuming the visual tokens are injected into the fullinput\_idssequence starting at an absolute positional offset𝒪v​i​s\\mathcal\{O\}\_\{vis\}, the final set of target visual token indicesV∗V^\{\*\}corresponding to the bounding boxBBis:

V∗=\{𝒪v​i​s\+⌊y′⋅Wg​r​i​d\+x′c⌋\|x′∈\[xm​i​n′,xm​a​x′\],y′∈\[ym​i​n′,ym​a​x′\]\}V^\{\*\}=\\left\\\{\\mathcal\{O\}\_\{vis\}\+\\left\\lfloor\\frac\{y^\{\\prime\}\\cdot W\_\{grid\}\+x^\{\\prime\}\}\{c\}\\right\\rfloor\\;\\middle\|\\;x^\{\\prime\}\\in\[x^\{\\prime\}\_\{min\},x^\{\\prime\}\_\{max\}\],y^\{\\prime\}\\in\[y^\{\\prime\}\_\{min\},y^\{\\prime\}\_\{max\}\]\\right\\\}\(5\)
This mapping ensures that the Retrieval Attention Mass \(RAM\) metric aggregates attention weights only over tokens representing the target entity, minimizing background noise and reflecting the spatial retrieval capability of the CoRe heads\.

### C\.2Mapping Strategies for Structural Visual Entities in MMDocIR

Unlike static single\-image tasks, the MMDocIR dataset presents an extreme long\-context challenge characterized by highly dense, interleaved multimodal sequences \(e\.g\., dozens of document pages containing interspersed textual paragraphs, figures, and tables\)\. Consequently, traditional spatial coordinate projection is inadequate\. To precisely isolate the token indices of the ground\-truth visual entityV∗V^\{\*\}\(e\.g\., a specific target figure\) within massive sequences \(often exceeding 100K tokens\), our mapping protocols must be explicitly tailored to the MLLM’s underlying visual encoding architecture\. We introduce three distinct strategies: theBoundary\-Tagging Protocolfor adapter\-based models,Uniform Sequence Slicingfor fixed\-expansion models, andDynamic Cumulative Slicingfor variable\-resolution models\.

#### Boundary\-Tagging Protocol \(e\.g\., the Qwen\-VL Paradigm\)

For models that dynamically compress visual sequences using Vision\-Language Adapters, we structurally intervene in the context construction\. Let the entire multimodal document context be represented as an interleaved sequence of elementsC=\{e1,e2,…,en\}C=\\\{e\_\{1\},e\_\{2\},\\dots,e\_\{n\}\\\}, where each elementeie\_\{i\}can be either a text block or a visual block\. Given a complex queryQQ, assume the oracle evidence is located at a specific visual elemente∗∈Ce^\{\*\}\\in C\.

1\. Context Tagging and Reconstruction:During the preprocessing phase, rather than altering the native tokenization alignment, we introduce two auxiliary boundary markers, denoted as𝒯s​t​a​r​t\\mathcal\{T\}\_\{start\}and𝒯e​n​d\\mathcal\{T\}\_\{end\}\(corresponding toSTART\_IDSandEND\_IDS\)\. The original contextCCis reconstructed into a tagged sequenceCt​a​gC\_\{tag\}, where the target elemente∗e^\{\*\}is explicitly enveloped:

Ct​a​g=\{e1,…,𝒯s​t​a​r​t,e∗,𝒯e​n​d,…,en\}C\_\{tag\}=\\\{e\_\{1\},\\dots,\\mathcal\{T\}\_\{start\},e^\{\*\},\\mathcal\{T\}\_\{end\},\\dots,e\_\{n\}\\\}\(6\)
2\. Tokenization and Target Extraction:The tagged sequenceCt​a​gC\_\{tag\}is processed by the tokenizer to generate a discrete 1D sequenceS=\[s1,s2,…,sL\]S=\[s\_\{1\},s\_\{2\},\\dots,s\_\{L\}\]\. By linearly scanningSS, we identify the absolute sequence indices of the boundary markers:

i​d​xs​t​a​r​t=arg⁡maxj⁡\(sj=𝒯s​t​a​r​t\),i​d​xe​n​d=arg⁡maxj⁡\(sj=𝒯e​n​d\)idx\_\{start\}=\\arg\\max\_\{j\}\(s\_\{j\}=\\mathcal\{T\}\_\{start\}\),\\quad idx\_\{end\}=\\arg\\max\_\{j\}\(s\_\{j\}=\\mathcal\{T\}\_\{end\}\)\(7\)The final set of target visual token indicesV∗V^\{\*\}corresponding toe∗e^\{\*\}is rigorously extracted as the enclosed sequence:

V∗=\{j∣i​d​xs​t​a​r​t<j<i​d​xe​n​d\}V^\{\*\}=\\\{j\\mid idx\_\{start\}<j<idx\_\{end\}\\\}\(8\)

#### Uniform Sequence Slicing \(e\.g\., the LLaVA Paradigm\)

Conversely, models like LLaVA circumvent complex spatial downsampling by deterministically expanding visual inputs into fixed\-length token sequences marked by specific placeholders \(e\.g\., the<image\>token\)\. For these architectures, tagging is unnecessary; instead, we rely on exact deterministic slicing\.

1\. Global Visual Token Identification:Let the ordered set of all visual inputs \(e\.g\., document pages\) in a sample beℐ=\{I0,I1,…,IN−1\}\\mathcal\{I\}=\\\{I\_\{0\},I\_\{1\},\\dots,I\_\{N\-1\}\\\}\. We execute a global scan across the completeinput\_idssequence to locate the absolute sequence indices of all visual tokens\. Let this ordered array be𝒫a​l​l=\[p0,p1,…,pM−1\]\\mathcal\{P\}\_\{all\}=\[p\_\{0\},p\_\{1\},\\dots,p\_\{M\-1\}\], whereMMis the total number of visual tokens\.

2\. Target Sequence Slicing:Since the vision processor encodes each image into a fixed\-length continuous chunk, the number of visual tokens allocated per image is derived asP=MNP=\\frac\{M\}\{N\}\. If the ground\-truth evidence is located in a specific subset of images whose index set is𝒦g​t⊂\{0,1,…,N−1\}\\mathcal\{K\}\_\{gt\}\\subset\\\{0,1,\\dots,N\-1\\\}, the target token indicesV∗V^\{\*\}are extracted through exact array slicing:

V∗=⋃k∈𝒦g​t\{𝒫a​l​l​\[j\]\|j∈\[k⋅P,\(k\+1\)⋅P−1\]\}V^\{\*\}=\\bigcup\_\{k\\in\\mathcal\{K\}\_\{gt\}\}\\left\\\{\\mathcal\{P\}\_\{all\}\[j\]\\;\\middle\|\\;j\\in\[k\\cdot P,\(k\+1\)\\cdot P\-1\]\\right\\\}\(9\)

#### Dynamic Cumulative Slicing \(e\.g\., the InternVL Paradigm\)

For advanced architectures like InternVL that employ dynamic resolution preprocessing, images are adaptively partitioned into a variable number of patches based on their native aspect ratios\. Consequently, the uniform slicing assumption \(P=M/NP=M/N\) is mathematically invalid\.

1\. Dynamic Patch Allocation Tracking:During the dynamic preprocessing phase, each imageIi∈ℐI\_\{i\}\\in\\mathcal\{I\}is partitioned intocic\_\{i\}visual blocks\. LetTb​l​o​c​kT\_\{block\}denote the fixed token length per block \(e\.g\., 256 tokens\)\. The specific number of visual tokens allocated for imageIiI\_\{i\}isPi=ci×Tb​l​o​c​kP\_\{i\}=c\_\{i\}\\times T\_\{block\}\. We maintain an ordered array of these dynamic token lengths:𝒱c​o​u​n​t​s=\[P0,P1,…,PN−1\]\\mathcal\{V\}\_\{counts\}=\[P\_\{0\},P\_\{1\},\\dots,P\_\{N\-1\}\]\.

2\. Cumulative Offset Alignment:To precisely isolate the tokens for a target ground\-truth imagek∈𝒦g​tk\\in\\mathcal\{K\}\_\{gt\}, we must compute the cumulative offset of all preceding visual tokens within the global visual token array𝒫a​l​l\\mathcal\{P\}\_\{all\}\. The start offset index is defined asOk=∑i=0k−1PiO\_\{k\}=\\sum\_\{i=0\}^\{k\-1\}P\_\{i\}\(whereO0=0O\_\{0\}=0\)\. The exact subset of target token indicesV∗V^\{\*\}is then extracted using this dynamic offset:

V∗=⋃k∈𝒦g​t\{𝒫a​l​l​\[j\]\|j∈\[Ok,Ok\+Pk−1\]\}V^\{\*\}=\\bigcup\_\{k\\in\\mathcal\{K\}\_\{gt\}\}\\left\\\{\\mathcal\{P\}\_\{all\}\[j\]\\;\\middle\|\\;j\\in\\left\[O\_\{k\},O\_\{k\}\+P\_\{k\}\-1\\right\]\\right\\\}\(10\)

#### Unified Attention Masking and Verification\.

Once the architecture\-specific subsetV∗V^\{\*\}is accurately obtained using tagging, uniform slicing, or dynamic cumulative slicing, we utilize a dynamic key\-value cache mechanism during the model’s forward pass to capture the attention probability matrix\. By slicing the global attention matrix exclusively at the indices belonging toV∗V^\{\*\}, we ensure that the Retrieval Attention Mass \(RAM\) metric precisely reflects the attention allocated to the target figure\. These rigorous strategies intrinsically prevent index shifting, guaranteeing that our quantitative probing remains completely immune to the extreme visual clutter and structural noise pervasive in long\-document contexts\.

### C\.3Mapping Strategies for Page\-Level Evidence in MMLongBench

Unlike fine\-grained spatial grounding or interleaved layout retrieval, MMLongBench evaluates the "needle\-in\-a\-haystack" retrieval capabilities of MLLMs across extensive document collections\. The input inherently consists of a massive sequence of high\-resolution, full\-page images \(e\.g\., multi\-page PDF documents\) seamlessly concatenated with complex user queries\. The primary challenge here shifts from spatial localization to macro\-level temporal/sequential isolation, as the context length frequently approaches the extreme limit of 100K\+ tokens\. To accurately compute the Retrieval Attention Mass \(RAM\) directed at specific ground\-truth pages, we implement architecture\-specific mapping protocols tailored to the underlying visual tokenization mechanisms: thePage\-Level Boundary\-Tagging Protocolfor adapter\-based models andUniform Page\-Level Slicingfor token\-expansion models\.

Let the long document be represented as a chronologically ordered sequence of full\-page visual elementsℐ=\{I0,I1,…,IK−1\}\\mathcal\{I\}=\\\{I\_\{0\},I\_\{1\},\\dots,I\_\{K\-1\}\\\}, whereKKdenotes the total page capacity \(e\.g\.,K=20K=20\)\. Based on the natural language queryQQ, the oracle visual evidence is localized within a specific subset of target pagesℰ∗⊂\{0,1,…,K−1\}\\mathcal\{E\}^\{\*\}\\subset\\\{0,1,\\dots,K\-1\\\}\.

#### Page\-Level Boundary\-Tagging Protocol \(e\.g\., the Qwen\-VL Paradigm\)\.

For models that employ complex vision\-language pooling, we structurally intervene during the multi\-turn message construction to prevent catastrophic token shifting over extreme sequence lengths\.

1\. Dynamic Prompt Tagging:We inject two specialized text tokens,𝒯s​t​a​r​t\\mathcal\{T\}\_\{start\}\(<GT\_START\>\) and𝒯e​n​d\\mathcal\{T\}\_\{end\}\(<GT\_END\>\), acting as explicit deterministic boundaries\. The visual sequence is reconstructed such that any target evidence pageIeI\_\{e\}\(e∈ℰ∗e\\in\\mathcal\{E\}^\{\*\}\) is tightly enveloped:

ℳi​n​p​u​t=\[…,Ie−1,𝒯s​t​a​r​t,Ie,𝒯e​n​d,Ie\+1,…\]⊕Q\\mathcal\{M\}\_\{input\}=\\Big\[\\dots,I\_\{e\-1\},\\;\\mathcal\{T\}\_\{start\},\\;I\_\{e\},\\;\\mathcal\{T\}\_\{end\},\\;I\_\{e\+1\},\\dots\\Big\]\\oplus Q\(11\)
2\. Global Sequence Scanning:The continuous page images are flattened into a massive 1D discrete token arrayS=\[s1,s2,…,sL\]S=\[s\_\{1\},s\_\{2\},\\dots,s\_\{L\}\]\. We execute a linear scan acrossSSto capture the absolute index boundaries for each evidence page:

i​d​xs​t​a​r​t\(e\)=arg⁡maxj⁡\(sj=𝒯s​t​a​r​t\),i​d​xe​n​d\(e\)=arg⁡maxj⁡\(sj=𝒯e​n​d\)idx\_\{start\}^\{\(e\)\}=\\arg\\max\_\{j\}\(s\_\{j\}=\\mathcal\{T\}\_\{start\}\),\\quad idx\_\{end\}^\{\(e\)\}=\\arg\\max\_\{j\}\(s\_\{j\}=\\mathcal\{T\}\_\{end\}\)\(12\)The ultimate set of target visual token indices is strictly defined as the union of all enclosed sequence blocks:

V∗=⋃e∈ℰ∗\{j∣i​d​xs​t​a​r​t\(e\)<j<i​d​xe​n​d\(e\)\}V^\{\*\}=\\bigcup\_\{e\\in\\mathcal\{E\}^\{\*\}\}\\\{j\\mid idx\_\{start\}^\{\(e\)\}<j<idx\_\{end\}^\{\(e\)\}\\\}\(13\)

#### Uniform Page\-Level Slicing \(e\.g\., the LLaVA Paradigm\)\.

For token\-expansion models like LLaVA, structural tagging is unnecessary\. Instead, these models allocate a fixed, deterministic number of tokens for each input image placeholder, allowing for precise mathematical slicing based on the model’s nativeimage\_token\_id\.

1\. Global Vision Token Extraction:By scanning the complete sequence of input IDs, we locate the absolute sequence positions of all visual tokens, forming an ordered array𝒫a​l​l=\[p0,p1,…,pM−1\]\\mathcal\{P\}\_\{all\}=\[p\_\{0\},p\_\{1\},\\dots,p\_\{M\-1\}\], whereMMis the total count of visual tokens in the entire document\.

2\. Deterministic Index Slicing:Given the uniform expansion property, the fixed number of tokens allocated per document page is derived asTp​a​g​e=MKT\_\{page\}=\\frac\{M\}\{K\}\. To isolate the exact token indices representing the target multi\-page evidence, we project the page indicese∈ℰ∗e\\in\\mathcal\{E\}^\{\*\}onto the global vision token array:

V∗=⋃e∈ℰ∗\{𝒫a​l​l​\[j\]\|j∈\[e⋅Tp​a​g​e,\(e\+1\)⋅Tp​a​g​e−1\]\}V^\{\*\}=\\bigcup\_\{e\\in\\mathcal\{E\}^\{\*\}\}\\left\\\{\\mathcal\{P\}\_\{all\}\[j\]\\;\\middle\|\\;j\\in\\left\[e\\cdot T\_\{page\},\(e\+1\)\\cdot T\_\{page\}\-1\\right\]\\right\\\}\(14\)

#### Unified Target Verification\.

By extracting the architecture\-specific subsetV∗V^\{\*\}and dynamically masking the global attention probability matrix during the model’s forward pass, we isolate the query\-to\-evidence attention flows\. These deterministic page\-level tracking paradigms guarantee that our quantitative probing is completely immune to the macro\-level structural noise generated by dozens of irrelevant document pages, rigorously validating the CoRe heads’ capacity for long\-range cross\-modal routing\.

### C\.4Mapping Strategies for Temporal Visual Entities in VidSTG

Unlike static spatial grounding or document layout analysis, the VidSTG benchmark evaluates the spatio\-temporal retrieval capabilities of MLLMs in dynamic, unconstrained video streams\. The input comprises a lengthy chronological sequence of sampled video frames, seamlessly concatenated with natural language queries\. As the temporal context expands \(e\.g\., up to 128 frames per video\), the model encounters massive temporal background noise\. To accurately compute the Retrieval Attention Mass \(RAM\) directed exclusively at the ground\-truth action or object frames, we adapt our architecture\-specific mapping protocols to the temporal domain, transitioning from page\-level extraction to fine\-grained temporal sequence isolation\.

Let the input video be represented as a chronologically ordered set of sampled frames𝒱=\{F0,F1,…,FN−1\}\\mathcal\{V\}=\\\{F\_\{0\},F\_\{1\},\\dots,F\_\{N\-1\}\\\}, whereNNdenotes the total number of extracted frames\. Based on the temporal annotations, the oracle visual evidence corresponds to a specific continuous or discrete subset of target frames, whose index set is defined as𝒦g​t⊂\{0,1,…,N−1\}\\mathcal\{K\}\_\{gt\}\\subset\\\{0,1,\\dots,N\-1\\\}\.

#### Temporal Boundary\-Tagging Protocol \(e\.g\., the Qwen\-VL Paradigm\)\.

For models employing Vision\-Language Adapters that dynamically compress temporal visual tokens, we intervene at the text\-prompt construction phase to prevent temporal index shifting\.

1\. Dynamic Frame Tagging:We inject explicit temporal boundaries,𝒯s​t​a​r​t\\mathcal\{T\}\_\{start\}and𝒯e​n​d\\mathcal\{T\}\_\{end\}\(corresponding toSTART\_IDSandEND\_IDS\), surrounding the specific target frames\. The temporal context is reconstructed such that any target evidence frameFkF\_\{k\}\(k∈𝒦g​tk\\in\\mathcal\{K\}\_\{gt\}\) is tightly enveloped:

ℳi​n​p​u​t=\[…,Fk−1,𝒯s​t​a​r​t,Fk,𝒯e​n​d,Fk\+1,…\]⊕Q\\mathcal\{M\}\_\{input\}=\\Big\[\\dots,F\_\{k\-1\},\\;\\mathcal\{T\}\_\{start\},\\;F\_\{k\},\\;\\mathcal\{T\}\_\{end\},\\;F\_\{k\+1\},\\dots\\Big\]\\oplus Q\(15\)
2\. Global Sequence Scanning:The entire video\-text context is flattened into a 1D discrete token arrayS=\[s1,s2,…,sL\]S=\[s\_\{1\},s\_\{2\},\\dots,s\_\{L\}\]\. A linear scan captures the absolute index boundaries for each tagged target frame:

i​d​xs​t​a​r​t\(k\)=arg⁡maxj⁡\(sj=𝒯s​t​a​r​t\),i​d​xe​n​d\(k\)=arg⁡maxj⁡\(sj=𝒯e​n​d\)idx\_\{start\}^\{\(k\)\}=\\arg\\max\_\{j\}\(s\_\{j\}=\\mathcal\{T\}\_\{start\}\),\\quad idx\_\{end\}^\{\(k\)\}=\\arg\\max\_\{j\}\(s\_\{j\}=\\mathcal\{T\}\_\{end\}\)\(16\)The target temporal token indicesV∗V^\{\*\}are defined as the union of these enclosed sequence blocks:

V∗=⋃k∈𝒦g​t\{j∣i​d​xs​t​a​r​t\(k\)<j<i​d​xe​n​d\(k\)\}V^\{\*\}=\\bigcup\_\{k\\in\\mathcal\{K\}\_\{gt\}\}\\\{j\\mid idx\_\{start\}^\{\(k\)\}<j<idx\_\{end\}^\{\(k\)\}\\\}\(17\)

#### Uniform Temporal Slicing \(e\.g\., the LLaVA Paradigm\)\.

For token\-expansion models like LLaVA\-OneVision, which allocate a fixed, deterministic number of tokens for each video frame placeholder, temporal structural tagging is omitted in favor of exact mathematical array slicing\.

1\. Global Frame Token Extraction:By scanning the completeinput\_idssequence, we locate the absolute positions of all visual tokens, forming an ordered array𝒫a​l​l=\[p0,p1,…,pM−1\]\\mathcal\{P\}\_\{all\}=\[p\_\{0\},p\_\{1\},\\dots,p\_\{M\-1\}\], whereMMis the total token count representing the entire video\.

2\. Deterministic Temporal Slicing:Given the uniform expansion property, the fixed token length allocated per frame is exactlyTf​r​a​m​e=MNT\_\{frame\}=\\frac\{M\}\{N\}\. To strictly isolate the tokens representing the target temporal tubes, we project the target frame indicesk∈𝒦g​tk\\in\\mathcal\{K\}\_\{gt\}onto the global array:

V∗=⋃k∈𝒦g​t\{𝒫a​l​l​\[j\]\|j∈\[k⋅Tf​r​a​m​e,\(k\+1\)⋅Tf​r​a​m​e−1\]\}V^\{\*\}=\\bigcup\_\{k\\in\\mathcal\{K\}\_\{gt\}\}\\left\\\{\\mathcal\{P\}\_\{all\}\[j\]\\;\\middle\|\\;j\\in\\left\[k\\cdot T\_\{frame\},\(k\+1\)\\cdot T\_\{frame\}\-1\\right\]\\right\\\}\(18\)

#### Dynamic Cumulative Temporal Slicing \(e\.g\., the InternVL Paradigm\)\.

For advanced architectures like InternVL that enforce dynamic resolution preprocessing, individual frames may be partitioned into a variable number of patches depending on their motion blur or native aspect ratios\. Therefore, the uniform assumption \(M/NM/N\) is violated, necessitating a dynamic cumulative tracking mechanism\.

1\. Dynamic Patch Allocation Tracking:During video preprocessing, each frameFiF\_\{i\}is adaptively partitioned intocic\_\{i\}visual blocks\. Assuming a fixed token length per blockTb​l​o​c​kT\_\{block\}, the total tokens allocated for frameFiF\_\{i\}isPi=ci×Tb​l​o​c​kP\_\{i\}=c\_\{i\}\\times T\_\{block\}\. We maintain an ordered tracking array of these dynamic lengths across the temporal axis:𝒱c​o​u​n​t​s=\[P0,P1,…,PN−1\]\\mathcal\{V\}\_\{counts\}=\[P\_\{0\},P\_\{1\},\\dots,P\_\{N\-1\}\]\.

2\. Cumulative Offset Alignment:To precisely isolate the tokens for a target temporal framek∈𝒦g​tk\\in\\mathcal\{K\}\_\{gt\}, we calculate the cumulative temporal offset of all preceding frames\. The start offset index is formulated asOk=∑i=0k−1PiO\_\{k\}=\\sum\_\{i=0\}^\{k\-1\}P\_\{i\}\(whereO0=0O\_\{0\}=0\)\. The exact target token indicesV∗V^\{\*\}are extracted via this dynamically computed sliding window:

V∗=⋃k∈𝒦g​t\{𝒫a​l​l​\[j\]\|j∈\[Ok,Ok\+Pk−1\]\}V^\{\*\}=\\bigcup\_\{k\\in\\mathcal\{K\}\_\{gt\}\}\\left\\\{\\mathcal\{P\}\_\{all\}\[j\]\\;\\middle\|\\;j\\in\\left\[O\_\{k\},O\_\{k\}\+P\_\{k\}\-1\\right\]\\right\\\}\(19\)

#### Unified Temporal Verification\.

By utilizing these rigorous, architecture\-aware mapping protocols to extract the precise subsetV∗V^\{\*\}, we dynamically slice the global attention probability matrix during inference\. This ensures that our RAM formulation strictly evaluates the attention routed solely to the query\-relevant spatio\-temporal segments\. These methodologies inherently neutralize the massive temporal noise introduced by dozens of irrelevant background frames, thereby strictly validating the CoRe heads’ temporal reasoning robustness\.

## Appendix DCoRe Heads: Detection Mechanisms and Model Configuration

### D\.1Model Architectures and Hyperparameters

To strictly ensure our findings regarding CoRe heads are intrinsic structural properties rather than artifacts of specific designs, we evaluate three representative MLLM families encompassing diverse vision\-language integration paradigms: adapter\-based compression, deterministic token expansion, and dynamic high\-resolution preprocessing\. The architectural distinctions are summarized in Table[3](https://arxiv.org/html/2606.05843#A4.T3)\.

Table 3:Overview of Evaluated MLLMs and their Visual Processing Paradigms
### D\.2Unified Detection Framework for CoRe Heads

Building upon the architecture\-specific mapping protocols detailed in previous sections, we abstract our empirical probing methodology into a unified, computationally efficient algorithm\. The core objective is to quantitatively extract the Retrieval Attention Mass \(RAM\) allocated to the ground\-truth visual entities across any combination of MLLM architectures and heterogeneous datasets \(ranging from static bounding boxes to extreme long\-context video tubes\)\.

Algorithm[1](https://arxiv.org/html/2606.05843#alg1)presents the complete detection framework\. To circumvent the prohibitive memory overhead of full\-sequence generation and backpropagation, we implement a customizedDynamic Key\-Value Cachemechanism\. By running a single forward pass without computing language modeling logits, we hook directly into the intermediate transformer layers\. Furthermore, the algorithm is meticulously designed to handle modern architectural variants, such as Grouped Query Attention \(GQA\), by explicitly repeating the Key states prior to the inner product\. Ultimately, the algorithm computes a normalized attention map𝐌R​A​M∈ℝL×H\\mathbf\{M\}\_\{RAM\}\\in\\mathbb\{R\}^\{L\\times H\}, identifying the precise topological location of CoRe heads\.

Algorithm 1Unified Extraction Framework for Retrieval Attention Mass \(RAM\)Input: Multimodal contextCC, QueryQQ, Ground\-truth visual entityE∗E^\{\*\}, ModelMM Output: Layer\-head attention allocation map𝐌R​A​M∈ℝL×H\\mathbf\{M\}\_\{RAM\}\\in\\mathbb\{R\}^\{L\\times H\}

1:Phase 1: Architecture\-Aware Token Mapping

2:

S←UnifiedTokenizer​\(C⊕Q\)S\\leftarrow\\text\{UnifiedTokenizer\}\(C\\oplus Q\)
3:

i​d​xQ←FindTokenIndices​\(S,Q\)idx\_\{Q\}\\leftarrow\\text\{FindTokenIndices\}\(S,Q\)\{Locate query tokens\}

4:if

MMuses Adapter\-based Compression \(e\.g\., Qwen\-VL\)then

5:ifContext is dense sequencethen

6:

V∗←BoundaryTagging​\(S,E∗,𝒯s​t​a​r​t,𝒯e​n​d\)V^\{\*\}\\leftarrow\\text\{BoundaryTagging\}\(S,E^\{\*\},\\mathcal\{T\}\_\{start\},\\mathcal\{T\}\_\{end\}\)
7:else

8:

V∗←SpatialProjection​\(E∗,downsample\_ratio=c\)V^\{\*\}\\leftarrow\\text\{SpatialProjection\}\(E^\{\*\},\\text\{downsample\\\_ratio\}=c\)
9:endif

10:elseif

MMuses Deterministic Expansion \(e\.g\., LLaVA\)then

11:

V∗←UniformSlicing​\(S,E∗,Tp​a​g​e/f​r​a​m​e\)V^\{\*\}\\leftarrow\\text\{UniformSlicing\}\(S,E^\{\*\},T\_\{page/frame\}\)
12:elseif

MMuses Dynamic High\-Res Preprocessing \(e\.g\., InternVL\)then

13:

𝒱c​o​u​n​t​s←TrackPatchAllocations​\(C\)\\mathcal\{V\}\_\{counts\}\\leftarrow\\text\{TrackPatchAllocations\}\(C\)
14:

V∗←CumulativeOffsetSlicing​\(S,E∗,𝒱c​o​u​n​t​s\)V^\{\*\}\\leftarrow\\text\{CumulativeOffsetSlicing\}\(S,E^\{\*\},\\mathcal\{V\}\_\{counts\}\)
15:endif

16:Phase 2: Custom Cache and Forward Pass

17:Initialize

𝒞k​v←DynamicCacheWithQuery​\(i​d​xQ\)\\mathcal\{C\}\_\{kv\}\\leftarrow\\text\{DynamicCacheWithQuery\}\(idx\_\{Q\}\)
18:

ModelForward\(M,input=S,past\_key\_values=𝒞k​v,compute\_logits=False\)\\text\{ModelForward\}\(M,\\text\{input\}=S,\\text\{past\\\_key\\\_values\}=\\mathcal\{C\}\_\{kv\},\\text\{compute\\\_logits\}=\\text\{False\}\)
19:Phase 3: RAM Computation across Layers and Heads

20:Initialize

𝐌R​A​M\\mathbf\{M\}\_\{RAM\}as an

L×HL\\times Hzero matrix

21:foreach layer

l∈\{1,2,…,L\}l\\in\\\{1,2,\\dots,L\\\}do

22:

Q\(l\),K\(l\)←𝒞k​v\.get\_layer​\(l\)Q^\{\(l\)\},K^\{\(l\)\}\\leftarrow\\mathcal\{C\}\_\{kv\}\.\\text\{get\\\_layer\}\(l\)
23:

KG​Q​A\(l\)←RepeatKV​\(K\(l\),num\_kv\_groups\)K^\{\(l\)\}\_\{GQA\}\\leftarrow\\text\{RepeatKV\}\(K^\{\(l\)\},\\text\{num\\\_kv\\\_groups\}\)\{Align dimensions for GQA\}

24:

𝐀\(l\)←Softmax​\(Q\(l\)​\(KG​Q​A\(l\)\)Tdh​e​a​d\)\\mathbf\{A\}^\{\(l\)\}\\leftarrow\\text\{Softmax\}\\left\(\\frac\{Q^\{\(l\)\}\\left\(K^\{\(l\)\}\_\{GQA\}\\right\)^\{T\}\}\{\\sqrt\{d\_\{head\}\}\}\\right\)\{Global attention probability\}

25:foreach head

h∈\{1,2,…,H\}h\\in\\\{1,2,\\dots,H\\\}do

26:

𝐌R​A​M​\[l,h\]←∑j∈V∗𝐀:,j\(l,h\)\\mathbf\{M\}\_\{RAM\}\[l,h\]\\leftarrow\\sum\_\{j\\in V^\{\*\}\}\\mathbf\{A\}^\{\(l,h\)\}\_\{:,j\}\{Aggregate mass over target tokens\}

27:endfor

28:endfor

29:return

𝐌R​A​M\\mathbf\{M\}\_\{RAM\}

## Appendix EFeature Analysis and Causal Verification of CoRe Heads

### E\.1Implementation Details for Cross\-Domain Stability Analysis

This section provides the implementation details for the cross\-domain stability and correlation analysis of CoRe heads presented in the main text\. To quantitatively assess the functional consistency of attention heads across diverse multimodal distributions, we implemented a unified evaluation pipeline across four representative datasets: RefCOCOg, MMLongBench, VidSTG, and MMDocIR\.

#### Data Processing and Spearman Correlation\.

For each datasett∈𝒯t\\in\\mathcal\{T\}, we first extract the head\-level Retrieval Attention Mass \(RAM\) scores, denoted asℳRAM\(l,h\)\\mathcal\{M\}\_\{\\text\{RAM\}\}^\{\(l,h\)\}, for all layerslland headshh\. Missing values resulting from layer\-head extraction are zero\-padded to ensure uniform tensor dimensions across tasks\. To measure cross\-task consistency, we flatten the spatial dimensions and compute the pairwise Spearman rank correlation coefficient between the score distributions of each task pair\. The resulting correlation matrix is visualized to highlight the strong positive correlations across all domains\.

#### Top\-kkStability Masking and Spatial Distribution\.

To isolate the universally critical heads, we compute a stability matrixS∈ℝL×HS\\in\\mathbb\{R\}^\{L\\times H\}that aggregates the occurrence frequency of top\-ranked heads across all tasks\. Specifically, for each tasktt, we calculate a dynamic thresholdτt\\tau\_\{t\}corresponding to the 95th percentile \(top 5%\) of its RAM score distribution\. We then generate a binary activation mask𝕀l,h\(t\)\\mathbb\{I\}\_\{l,h\}^\{\(t\)\}for each head:

𝕀l,h\(t\)=\{1,if​ℳRAM,t\(l,h\)≥τt0,otherwise\\mathbb\{I\}\_\{l,h\}^\{\(t\)\}=\\begin\{cases\}1,&\\text\{if \}\\mathcal\{M\}\_\{\\text\{RAM\},t\}^\{\(l,h\)\}\\geq\\tau\_\{t\}\\\\ 0,&\\text\{otherwise\}\\end\{cases\}\(20\)
The cross\-domain stability count for a given head is computed by summing these indicator variables over all evaluating tasks:

Sl,h=∑t∈𝒯𝕀l,h\(t\)S\_\{l,h\}=\\sum\_\{t\\in\\mathcal\{T\}\}\\mathbb\{I\}\_\{l,h\}^\{\(t\)\}\(21\)whereSl,h∈\{0,1,2,3,4\}S\_\{l,h\}\\in\\\{0,1,2,3,4\\\}\.

#### Visualization Setup\.

The stability matrix is visualized using a custom discrete linear segmented colormap\. To ensure the accurate visual representation of the discrete task counts, the colorbar visualization boundaries are shifted by−0\.5\-0\.5\(spanning from−0\.5\-0\.5to3\.53\.5forN=4N=4task overlaps\) so that the color ticks perfectly align with the integer centers\. All visualizations strictly adhere to the standard typographical guidelines, utilizing Times New Roman and STIX fonts for consistency\.

### E\.2Implementation Details for Attention Head Intervention

To rigorously assess the causal impact of CoRe heads \(as discussed in Section[4\.4](https://arxiv.org/html/2606.05843#S4.SS4)\), we implemented a head\-level attention intervention mechanism\. The complete unified intervention procedure is formally outlined in Algorithm[2](https://arxiv.org/html/2606.05843#alg2)\.

Algorithm 2Memory\-Efficient Chunk\-wise Attention Head InterventionInput: QueryQ∈ℝB×H×N×dQ\\in\\mathbb\{R\}^\{B\\times H\\times N\\times d\}, KeyK∈ℝB×H×M×dK\\in\\mathbb\{R\}^\{B\\times H\\times M\\times d\}, ValueV∈ℝB×H×M×dV\\in\\mathbb\{R\}^\{B\\times H\\times M\\times d\} Input: Current layer indexll, Ablation setℬ\\mathcal\{B\}, Chunk sizeCC\(e\.g\., 256\), Scaling factorα\\alpha Output: Intervened Attention OutputO∈ℝB×H×N×dO\\in\\mathbb\{R\}^\{B\\times H\\times N\\times d\}

1:Initialize empty output tensor

OOwith shape of

QQ
2:Extract target heads for current layer:

ℋl←\{h∣\(l,h\)∈ℬ\}\\mathcal\{H\}\_\{l\}\\leftarrow\\\{h\\mid\(l,h\)\\in\\mathcal\{B\}\\\}
3:if

ℋl=∅\\mathcal\{H\}\_\{l\}=\\emptysetthen

4:return

FlashAttention​\(Q,K,V\)\\text\{FlashAttention\}\(Q,K,V\)
5:endif

6:for

i=0i=0to

N−1N\-1step

CCdo

7:

j←min⁡\(i\+C,N\)j\\leftarrow\\min\(i\+C,N\)
8:

Qchunk←Q\[:,:,i:j,:\]Q\_\{\\text\{chunk\}\}\\leftarrow Q\[:,:,i:j,:\]
9:

Schunk←\(Qchunk​K⊤\)⋅αS\_\{\\text\{chunk\}\}\\leftarrow\(Q\_\{\\text\{chunk\}\}K^\{\\top\}\)\\cdot\\alpha\{Compute pre\-softmax logits\}

10:Apply standard causal/padding mask to

SchunkS\_\{\\text\{chunk\}\}if necessary

11:for

h∈ℋlh\\in\\mathcal\{H\}\_\{l\}do

12:

Schunk​\[:,h,:,:\]←0\.0S\_\{\\text\{chunk\}\}\[:,h,:,:\]\\leftarrow 0\.0\{Ablate selective retrieval via logit neutralization\}

13:endfor

14:

Achunk←Softmax​\(Schunk,dim=−1\)A\_\{\\text\{chunk\}\}\\leftarrow\\text\{Softmax\}\(S\_\{\\text\{chunk\}\},\\text\{dim\}=\-1\)
15:

Achunk←Dropout​\(Achunk\)A\_\{\\text\{chunk\}\}\\leftarrow\\text\{Dropout\}\(A\_\{\\text\{chunk\}\}\)
16:

Ochunk←Achunk​VO\_\{\\text\{chunk\}\}\\leftarrow A\_\{\\text\{chunk\}\}V
17:

O\[:,:,i:j,:\]←OchunkO\[:,:,i:j,:\]\\leftarrow O\_\{\\text\{chunk\}\}
18:Free memory:Delete

Qchunk,Schunk,Achunk,OchunkQ\_\{\\text\{chunk\}\},S\_\{\\text\{chunk\}\},A\_\{\\text\{chunk\}\},O\_\{\\text\{chunk\}\}
19:endfor

20:

O←Transpose and format​OO\\leftarrow\\text\{Transpose and format \}O
21:return

O,NoneO,\\text\{None\}\{Avoid returning full attention weight matrices to save VRAM\}

### E\.3Calculation of the Key Token Ratio

To quantitatively evaluate the extraction precision and efficiency of the identified CoRe heads, we define a granular, token\-level metric termed theKey Token Ratio\. This metric assesses whether a specific attention head successfully concentrates its highest attention mass on the exact ground\-truth visual tokens required to answer the query, filtering out surrounding multimodal noise\.

#### Mathematical Formulation\.

Given an input sequenceSSof lengthLL, letQ⊂SQ\\subset Sdenote the subset of tokens corresponding to the linguistic question, and letV∗⊂SV^\{\*\}\\subset Sdenote the subset of key target visual tokens \(e\.g\., the specific image patches or figure tokens corresponding to the ground\-truth answer\)\.

During the forward pass, for a specific attention headhhin layerll, we extract the post\-softmax attention probability matrix\. We first aggregate the attention distribution directed from the query tokens to the entire sequence by averaging over the query dimension:

A¯\(h\)=1\|Q\|​∑q∈QAq→S\(h\)\\bar\{A\}^\{\(h\)\}=\\frac\{1\}\{\|Q\|\}\\sum\_\{q\\in Q\}A\_\{q\\rightarrow S\}^\{\(h\)\}\(22\)whereAq→S\(h\)∈ℝLA\_\{q\\rightarrow S\}^\{\(h\)\}\\in\\mathbb\{R\}^\{L\}represents the attention probability distribution from a single query tokenqqto all tokens in the sequence\.

To evaluate the precision of headhh, we define a stringent threshold by isolating only the top 5% of tokens that receive the highest attention scores\. Letk=max⁡\(1,⌊L×0\.05⌋\)k=\\max\(1,\\lfloor L\\times 0\.05\\rfloor\)\. We define the highly\-attended token set𝒯top\(h\)\\mathcal\{T\}\_\{\\text\{top\}\}^\{\(h\)\}as the indices of the topkkvalues in the aggregated distributionA¯\(h\)\\bar\{A\}^\{\(h\)\}\.

The Key Token Ratio for headhhis then defined as the percentage of ground\-truth target tokens successfully captured within this top 5% attended set:

Key Token Ratio\(h\)=\|𝒯top\(h\)∩V∗\|\|V∗\|×100\\text\{Key Token Ratio\}^\{\(h\)\}=\\frac\{\|\\mathcal\{T\}\_\{\\text\{top\}\}^\{\(h\)\}\\cap V^\{\*\}\|\}\{\|V^\{\*\}\|\}\\times 100\(23\)
During evaluation, if a head’s Key Token Ratio meets or exceeds a predefined threshold \(e\.g\., 50%\), it is recorded as a successful "hit"\. We compute the average hit rate across the selected Top\-KKand Bottom\-KKhead populations to demonstrate the functional divergence between CoRe heads and standard vision heads\.

### E\.4Calculation of the Key Token Coverage

While the Key Token Ratio \(detailed in Section[E\.3](https://arxiv.org/html/2606.05843#A5.SS3)\) evaluates the retrieval precision of individual attention heads, it does not account for the collaborative dynamics among them\. To assess the collective retrieval capacity of a population of heads, we introduce theKey Token Coveragemetric\. This metric quantifies the proportion of the ground\-truth visual tokens that are successfully captured by the aggregate attention of a specific subset of heads \(e\.g\., the Top\-KKCoRe heads versus the Bottom\-KKheads\)\.

#### Mathematical Formulation\.

Following the previous notations, letSSdenote the input sequence,QQdenote the query tokens, andV∗V^\{\*\}denote the set of critical ground\-truth visual tokens\. For a given attention headhh, we compute the query\-aggregated 1D attention distributionA¯\(h\)\\bar\{A\}^\{\(h\)\}and extract the indices of the topp%p\\%\(e\.g\.,p=5p=5\) most attended tokens, denoted as𝒯top\(h\)\\mathcal\{T\}\_\{\\text\{top\}\}^\{\(h\)\}\.

Let𝒢\\mathcal\{G\}represent a designated group of attention heads evaluated as an ensemble \(for instance,𝒢top\\mathcal\{G\}\_\{\\text\{top\}\}representing the Top\-250 heads based on RAM scores\)\. The subset of ground\-truth target tokens successfully hit by an individual headhhis given by the intersection\(𝒯top\(h\)∩V∗\)\(\\mathcal\{T\}\_\{\\text\{top\}\}^\{\(h\)\}\\cap V^\{\*\}\)\.

To determine the collective coverage of the entire head group𝒢\\mathcal\{G\}, we compute the union of all successfully hit target tokens across all heads in the group\. This ensures that redundant retrievals of the same visual token by multiple heads are systematically deduplicated:

𝒰𝒢=⋃h∈𝒢\(𝒯top\(h\)∩V∗\)\\mathcal\{U\}\_\{\\mathcal\{G\}\}=\\bigcup\_\{h\\in\\mathcal\{G\}\}\\left\(\\mathcal\{T\}\_\{\\text\{top\}\}^\{\(h\)\}\\cap V^\{\*\}\\right\)\(24\)
The Key Token Coverage for the group𝒢\\mathcal\{G\}is then defined as the ratio of this union’s cardinality to the total number of ground\-truth visual tokens:

Key Token Coverage​\(𝒢\)=\|𝒰𝒢\|\|V∗\|×100\\text\{Key Token Coverage\}\(\\mathcal\{G\}\)=\\frac\{\|\\mathcal\{U\}\_\{\\mathcal\{G\}\}\|\}\{\|V^\{\*\}\|\}\\times 100\(25\)

#### Evaluation Protocol\.

During our evaluation on the MMDocIR dataset, we systematically compute this coverage metric for both the highest\-ranked CoRe heads and the lowest\-ranked baseline heads across both LLaVA\-OneVision and Qwen3\-VL architectures\.

By comparingKey Token Coverage​\(𝒢top\)\\text\{Key Token Coverage\}\(\\mathcal\{G\}\_\{\\text\{top\}\}\)againstKey Token Coverage​\(𝒢bottom\)\\text\{Key Token Coverage\}\(\\mathcal\{G\}\_\{\\text\{bottom\}\}\), we quantitatively demonstrate the structural sparsity of multimodal retrieval\. The results confirm that a highly compact subset of Top\-KKheads collectively encompasses the vast majority of necessary visual evidence, whereas the cumulative receptive field of the Bottom\-KKheads fails to align with the semantic targets\. Consistent with the individual head evaluations, the target token extraction \(V∗V^\{\*\}\) and the memory\-efficient sequential layer\-wise caching are employed to prevent memory bottlenecks during long\-context inference\.

### E\.5Qualitative Visualization of Functional Dichotomy in Object Grounding

Figure[8](https://arxiv.org/html/2606.05843#A5.F8)provides compelling qualitative evidence of the functional dichotomy within the model’s attention mechanism on the RefCOCOg dataset\. As illustrated, high\-scoring CoRe heads act as precise information extractors, consistently and accurately localizing the semantically critical entities specified by complex textual queries \(e\.g\., specific clothing attributes or fine\-grained spatial relationships\)\. Conversely, the low\-scoring heads exhibit severe semantic dispersion, scattering their attention mass across uninformative background regions or irrelevant objects\. These stark visual contrasts directly corroborate our quantitative findings, confirming that fine\-grained cross\-modal grounding is exclusively governed by a highly specialized, sparse subset of attention heads rather than homogeneously distributed across the network\.

![Refer to caption](https://arxiv.org/html/2606.05843v1/x11.png)Figure 8:Qualitative comparison of attention allocation on the RefCOCOg dataset\.Left \(CoRe Heads\):High\-scoring heads demonstrate highly precise spatial grounding, accurately isolating the visual tokens corresponding to the query\-relevant entities \(highlighted in red\)\.Right \(Low\-scoring Heads\):The remaining heads fail to capture task\-relevant visual cues, instead distributing their attention diffusely across background noise and uninformative regions\. This visually confirms the role of CoRe heads as dedicated cross\-modal extractors\.

## Appendix FExtended Analysis of CoRe Attention Topologies and Model Scaling

To further substantiate the observations regarding model\-wide attention shifts discussed in the main text, we provide comprehensive heatmaps of theℳRAM\(h\)\\mathcal\{M\}\_\{\\text\{RAM\}\}^\{\(h\)\}\(Relative Activation Magnitude\) scores for the CoRe attention heads\. This extended visualization contrasts the attention topologies of the Qwen3\-VL series across varying parameter scales \(2B vs\. 4B\) and task modalities, ranging from static image understanding \(COCO\) to complex spatio\-temporal reasoning \(VidSTG\) and long\-context document processing \(MMLongBench\)\.The empirical results consistently demonstrate that as task complexity and model scale increase, the attention distribution transitions from a broadly dispersed state to a highly localized "bottleneck structure" within the intermediate\-to\-deep layers\. This structural divergence suggests that larger models develop specialized functional regions to manage the increased informational entropy inherent in long\-range and multi\-modal reasoning, effectively distilling critical cues through a narrowed architectural bottleneck\.

![Refer to caption](https://arxiv.org/html/2606.05843v1/x12.png)Figure 9:Heatmaps of CoRe head activation distributions across heterogeneous datasets and model scales\. Warmer colors indicate higher normalized activation scores\. \(a\) Task\-Specific Patterns: Contrasting the 4B model on COCO, VidSTG, and MMLongBench reveals that complex temporal and long\-context tasks drive activations to cluster in specific deep layers \(Layers 12–28\), forming a distinct "reasoning bottleneck\." \(b\) Scaling Effects: On the VidSTG dataset, the 4B model demonstrates a more structured, deep\-layer attention concentration compared to the fragmented activations of the 2B model, indicating that increased capacity facilitates efficient feature localization\.
## Appendix GSystem\-Level Acceleration and Additional Benchmarks

### G\.1CoRe\-Guided Hybrid Attention Mechanism

The quadratic computational complexity of the self\-attention mechanism poses a severe bottleneck for Multimodal Large Language Models \(MLLMs\), particularly during the prefill stage of processing long visual contexts \(e\.g\., high\-resolution images or multi\-page documents\)\. Based on our finding that cross\-modal semantic integration is highly localized within a sparse subset of CoRe heads, we introduce a system\-level acceleration paradigm:Head\-level Hybrid Attention\.

As illustrated in our architectural design, we decouple the attention computation during the prefill stage based on the intrinsic importance of each attention head\. The formulation is as follows:

1. 1\.Head Configuration via CoRe Ranking: All attention heads are ranked according to their expected semantic contribution to cross\-modal integration \(CoRe Score\)\. We partition the heads into two sets: the top\-kkcritical CoRe heads \(ℋdense\\mathcal\{H\}\_\{\\text\{dense\}\}\) and the remaining non\-essential heads \(ℋsparse\\mathcal\{H\}\_\{\\text\{sparse\}\}\)\.
2. 2\.Top\-kkFull Attention: For heads inℋdense\\mathcal\{H\}\_\{\\text\{dense\}\}, we retain the standard global dense attention pattern, allowing these routing hubs to maintain unconstrained receptive fields for precise visual feature extraction\.
3. 3\.Stream Sparse Attention: For the vast majority of heads inℋsparse\\mathcal\{H\}\_\{\\text\{sparse\}\}, global connections are functionally redundant\. We restrict their computation to a local sliding window\. For a query at positionii, these heads strictly attend to keys within a localized window\[i−w,i\+w\]\[i\-w,i\+w\]alongside a small set of initial attention sinks\.

During the decoding stage \(autoregressive generation\), the sequence length of the query is11, rendering the computational overhead of global attention negligible\. Therefore, the hybrid pattern is exclusively applied to the prefill stage, seamlessly transitioning back to standard attention during decoding\.

### G\.2Algorithm and Implementation Details

The unified forward pass for the CoRe\-guided acceleration is formalized in Algorithm[3](https://arxiv.org/html/2606.05843#alg3)\.

Algorithm 3CoRe\-Guided Head\-level Hybrid Attention \(Forward Pass\)Input: Hidden statesXX, Position Embeddings\(c​o​s,s​i​n\)\(cos,sin\) Input: Current layer indexll, Top\-kkdense head setℋdense\\mathcal\{H\}\_\{\\text\{dense\}\} Parameters:FULL\_ATTN\_FLAG=0\\texttt\{FULL\\\_ATTN\\\_FLAG\}=0,SPARSE\_ATTN\_FLAG=−1\\texttt\{SPARSE\\\_ATTN\\\_FLAG\}=\-1

1:

Q,K,V←ProjectAndApplyRoPE​\(X,c​o​s,s​i​n\)Q,K,V\\leftarrow\\text\{ProjectAndApplyRoPE\}\(X,cos,sin\)
2:

B,H,N,D←Q\.shapeB,H,N,D\\leftarrow Q\.\\text\{shape\}\{Batch size, Num heads, Seq length, Head dim\}

3:

is\_prefill←N\>1\\text\{is\\\_prefill\}\\leftarrow N\>1
4:ifnotis\_prefillthen

5:\{Decoding Stage: Use standard FlashAttention for efficiency\}

6:

O,weights←StandardAttention​\(Q,K,V\)O,\\text\{weights\}\\leftarrow\\text\{StandardAttention\}\(Q,K,V\)
7:else

8:\{Prefill Stage: Apply Head\-level Hybrid Attention\}

9:Initialize

head\_mask∈ℝH\\texttt\{head\\\_mask\}\\in\\mathbb\{R\}^\{H\}withSPARSE\_ATTN\_FLAG

10:Extract dense heads for current layer:

ℋl←\{h∣\(l,h\)∈ℋdense\}\\mathcal\{H\}\_\{l\}\\leftarrow\\\{h\\mid\(l,h\)\\in\\mathcal\{H\}\_\{\\text\{dense\}\}\\\}
11:if

ℋl≠∅\\mathcal\{H\}\_\{l\}\\neq\\emptysetthen

12:

head\_mask​\[ℋl\]←FULL\_ATTN\_FLAG\\texttt\{head\\\_mask\}\[\\mathcal\{H\}\_\{l\}\]\\leftarrow\\texttt\{FULL\\\_ATTN\\\_FLAG\}\{Elevate target heads to dense\}

13:endif

14:Define

streaming\_info←\[4,32\]×H\\texttt\{streaming\\\_info\}\\leftarrow\[4,32\]\\times H\{Set attention sinks and window size\}

15:Flatten

Q,K,VQ,K,Vto shape

\(B×N,H,D\)\(B\\times N,H,D\)for var\-len kernel

16:

O←block\_streaming\_attn\_func\(O\\leftarrow\\texttt\{block\\\_streaming\\\_attn\\\_func\}\(
17:

Q,K,V,head\_mask,streaming\_info,is\_causal=TrueQ,K,V,\\texttt\{head\\\_mask\},\\texttt\{streaming\\\_info\},\\text\{is\\\_causal\}=\\text\{True\}
18:

\)\)
19:Reshape

OOback to

\(B,N,H,D\)\(B,N,H,D\)
20:endif

21:

O←OutputProjection​\(O\)O\\leftarrow\\text\{OutputProjection\}\(O\)
22:return

O,NoneO,\\text\{None\}

By utilizing this fused kernel strategy, we effectively convert the theoreticalO​\(N2\)O\(N^\{2\}\)complexity of standard attention intoO​\(N⋅w\)O\(N\\cdot w\)for the vast majority of heads\. As empirically demonstrated in our main experiments \(Table[2](https://arxiv.org/html/2606.05843#S4.T2)\), this system\-level optimization yields substantial prefill speedups \(up to2\.3×2\.3\\times\) while simultaneously preserving, and in some granular reasoning tasks even slightly enhancing, the multimodal comprehension capabilities of the baseline models\.

### G\.3Long\-Video Inference Performance of the Video\-MME Benchmark

To evaluate the acceleration strategy on long\-context multimodal tasks, we benchmarked the models on the Video\-MME dataset\. This benchmark tests model performance across perception, recognition, problem\-solving, and reasoning\. As shown in Table[4](https://arxiv.org/html/2606.05843#A7.T4), our sparse attention configurations match or occasionally exceed the performance of fully dense baselines\.

Overall performance remains stable across all three model families under high sparsity\. Instead of degrading, models frequently show slight improvements at specific sparsity levels\. For LLaVA\-OneVision, the dense baseline scores an average of 56\.3\. When retaining only 19\.1% and 25\.5% of the attention heads, this average increases to 56\.5 and 56\.6\. Similarly, Qwen3\-VL\-8B matches its dense baseline average of 65\.0 while using only 2\.6% of its attention heads\. This performance stabilization suggests an implicit regularization effect\. Dense attention mechanisms in long\-video contexts tend to aggregate irrelevant background noise\. Masking non\-essential heads forces the model to rely on specialized CoRe heads, filtering out visual distractions and focusing computation on relevant spatiotemporal information\.

The Qwen3\-VL family results demonstrate a scaling pattern for attention head redundancy\. The 32B model maintains stable performance under extreme sparsity\. Retaining 4\.9% of its heads yields a score of 69\.7, a 0\.2\-point decrease from the dense baseline of 69\.9\. At a 1\.2% retention rate, performance slightly decreases to 69\.4\. This scaling behavior indicates that visual information routing becomes more concentrated as model capacity increases\. Larger models exhibit higher functional redundancy in their attention layers, permitting aggressive pruning during prefill without affecting reasoning accuracy\.

Examining specific Video\-MME sub\-tasks shows how sparsity affects different cognitive dimensions\. Spatial perception and OCR metrics frequently improve under sparse conditions\. LLaVA’s spatial perception increases from 55\.6 in the dense setting to 59\.3 in the 19\.1% configuration, and Qwen\-32B’s OCR reaches 78\.4 in the 1\.2% configuration, compared to the 76\.3 baseline\. This indicates that isolating CoRe heads aids in localizing spatial details and text within noisy frames\. Higher\-order reasoning tasks, such as action, object, and spatio\-temporal reasoning, are largely unaffected by head masking\. Qwen\-8B maintains a score of 80\.4 in spatial reasoning across most sparsity levels, confirming that the retained heads capture the semantic logic necessary for video comprehension\.

Table 4:Overall evaluation results on the VideoMME benchmark\. Best results per model family arebolded\. The subscript indicates the absolute difference compared to the corresponding Dense baseline\.
### G\.4Prefill Latency Benchmarking

We designed a controlled micro\-benchmarking protocol to evaluate the speedup achieved by our CoRe\-guided hybrid attention mechanism\. This protocol isolates the self\-attention bottleneck within the language modeling backbone, allowing us to precisely measure the prefill latency across varying context windows ranging from 8k to 128k tokens\.

In Multimodal Large Language Models \(MLLMs\) like Qwen3\-VL, theO​\(N2\)O\(N^\{2\}\)computational complexity is primarily localized within the self\-attention layers of the core language model, rather than the vision encoder\. To isolate this scaling behavior and eliminate the constant\-time overhead of visual feature extraction, our benchmarking script directly targets the text\-conditional generation module\. We simulate multimodal long\-context inputs by generating randomized discrete token sequences and fully unmasked tensors of lengthNN\. This setup ensures that the measured latency strictly reflects the computational cost of the attention operations\.

Prior to the forward pass, we leverage the empirical Retrieval Attention Mass \(RAM\) scores to construct a staticblock\_list\. This list dictates the specific layers and heads designated for dense full attention versus Stream Sparse Attention\. We inject this configuration into our Triton\-based attention kernel, ensuring that the hardware dispatch aligns perfectly with our theoretical attention distribution\.

To guarantee precise timing, we employ CUDA event tracking, which circumvents CPU\-GPU asynchronous execution discrepancies and Python interpreter overhead\. Before recording the measurements, we execute two warmup forward passes to initialize the CUDA context, allocate KV\-cache buffers, and stabilize GPU clock speeds\. Subsequently, we perform five independent forward passes for each sequence length\. We enforce strict synchronization before and after each pass to capture the exact physical GPU execution time\. The final prefill latency is reported as the arithmetic mean of these five runs\.

Additionally, to evaluate memory scalability, the script is designed to catchOutOfMemoryErrorexceptions\. If a specific context length exceeds the VRAM capacity, the process logs the failure, clears the CUDA cache, and safely proceeds to the next configuration\. This mechanism allows us to identify the exact context\-length boundaries where standard dense attention fails but our hybrid attention continues to operate successfully\.

## Appendix HLimitation

While this study provides robust mechanistic insights into the functional sparsity of Multimodal Large Language Models \(MLLMs\), several limitations warrant future investigation\. Primarily, the formulation of our core metric, Retrieval Attention Mass \(RAM\), relies heavily on explicit ground\-truth spatial or temporal annotations \(e\.g\., bounding boxes or video tubes\) to locate target visual entities, which complicates the identification of CoRe heads in entirely unannotated or purely abstract reasoning contexts\. Furthermore, although our evaluations span diverse, representative late\-fusion transformer architectures, it remains an open question whether this distinct functional dichotomy universally emerges in natively multimodal early\-fusion models or non\-transformer frameworks\. Finally, our proposed CoRe\-guided hybrid attention mechanism currently employs a deterministic, static head configuration during the prefill stage; while highly efficient on standard benchmarks, advancing towards a dynamic, input\-adaptive routing strategy could further enhance model robustness for complex, out\-of\-distribution queries\.

Similar Articles

Inference Time Context Sparsity: Illusion or Opportunity?

arXiv cs.AI

This paper argues that extreme context sparsity is a principled and feasible foundation for LLM inference, showing that current models tolerate up to 100× sparsity without quality loss and that sparse decode kernels can accelerate processing by 10× on existing hardware.

MiniMax Sparse Attention

Hugging Face Daily Papers

MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.