From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

arXiv cs.AI 06/10/26, 04:00 AM Papers
Summary
This paper studies how audio and visual information flow inside Audio-Visual Large Language Models (AVLLMs), revealing that AVLLMs follow sequential or parallel routing depending on input configuration, and that some tokens can be discarded after information transfer for efficiency.
arXiv:2606.10147v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.
Original Article
View Cached Full Text
Cached at: 06/10/26, 06:13 AM
# From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
Source: [https://arxiv.org/html/2606.10147](https://arxiv.org/html/2606.10147)
Wish Suharitdamrong1&Muhammad Awais1,2&Xiatian Zhu1,2&Sara Atito1,2 1Surrey Institute for People\-Centred AI \(PAI\), University of Surrey, UK 2Centre for Vision, Speech and Signal Processing \(CVSSP\), University of Surrey, UK

###### Abstract

Multimodal Large Language Models \(MLLMs\) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real\-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood\. In this study, we examine audio\-visual information flow inside Audio\-Visual Large Language Models \(AVLLMs\), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio\-visual video and multiple interleaved audio\-visual items\. We find that for audio\-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task’s reliance on each modality\. In settings with multiple interleaved audio\-visual items, this routing shifts to different parallel streams\. Furthermore, we demonstrate that audio\-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model’s prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference\. These findings hold across multiple models and scales, Qwen2\.5\-Omni and Video\-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge\. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio\-visual and broader MLLMs\.

## 1Introduction

Multimodal Large Language Models \(MLLMs\)Teamet al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib40)\); Hurstet al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib39)\)have progressed rapidly, jointly processing auditory and visual information in models that can both listen and see, bringing machine perception closer to human perception\. Earlier research developed each modality independently, leading to specialized Vision\-Language models \(VLMs\)Liuet al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib12)\); Liet al\.\([2024a](https://arxiv.org/html/2606.10147#bib.bib13)\); Anet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib14)\); Baiet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib15)\); Tonget al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib16)\); Zhanget al\.\([2025a](https://arxiv.org/html/2606.10147#bib.bib38)\); Wanget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib17)\)and Audio\-Language models \(ALMs\)Gonget al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib18)\); Tanget al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib20)\); Ghoshet al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib19)\); Goelet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib21)\); Chuet al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib22)\), each operating effectively within its target modality\. Recent Audio\-Visual Large Language Models \(AVLLMs\)Xuet al\.\([2025a](https://arxiv.org/html/2606.10147#bib.bib4)\); Tanget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib5)\); Xuet al\.\([2025b](https://arxiv.org/html/2606.10147#bib.bib10)\); Fuet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib37)\); Chenget al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib23)\); Team \([2026](https://arxiv.org/html/2606.10147#bib.bib36)\)integrate visual and auditory inputs to enable unified audio\-visual understanding\. These models can answer questions about audio\-visual scenes and transcribe visually grounded speech, tasks requiring cross\-modal reasoning across the audio and visual modalities\. These models span input formats from single images, videos, or audio clips to audio\-visual videos and multiple interleaved audio\-visual items, reaching diverse real\-world scenarios\. Around these models, an active research landscape has emerged, including benchmarks probing audio\-visual understanding across rich and complex scenariosYanget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib46)\); Liet al\.\([2025a](https://arxiv.org/html/2606.10147#bib.bib49)\); Zhouet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib47)\); Liet al\.\([2024b](https://arxiv.org/html/2606.10147#bib.bib48)\), parameter\-efficient fine\-tuning methodsWeiet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib41)\), adaptation\-based token compressionGonget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib44)\); Dinget al\.\([2026](https://arxiv.org/html/2606.10147#bib.bib42)\), and training\-free token compression at inferenceTaoet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib43)\); Li and Huang \([2026](https://arxiv.org/html/2606.10147#bib.bib45)\)\.

In parallel, mechanistic interpretability has made significant progress in uncovering the internal mechanisms inside LLMsNandaet al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib50)\); Elhageet al\.\([2021](https://arxiv.org/html/2606.10147#bib.bib52)\); Raiet al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib51)\); Gevaet al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib9)\)\. Similar techniques have recently been extended to MLLMsBasuet al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib24)\); Nikankinet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib25)\); Neoet al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib26)\); Zhanget al\.\([2025b](https://arxiv.org/html/2606.10147#bib.bib7)\); Kimet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib6)\); Kaduriet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib27)\); Selvakumaret al\.\([2026](https://arxiv.org/html/2606.10147#bib.bib28)\)\. In particular, attention knockout has been used to trace how cross\-modal information flow emerges from image inputs in VLMsZhanget al\.\([2025b](https://arxiv.org/html/2606.10147#bib.bib7)\)and how spatiotemporal information flow emerges from video inputs in VideoLLMsKimet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib6)\)\. Beyond these mechanistic studies, multi\-image input handling has been actively studied in VLMs, surfacing failure modes and motivating mitigation strategies\. Cross\-image information leakage has been identified as a core failure mode, where visual content from different images entangles in the outputParket al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib31)\)\. Delimiter tokens have been examined and exploited as a mechanism to limit this entanglementLeeet al\.\([2026](https://arxiv.org/html/2606.10147#bib.bib30)\)\. Performance on multi\-image tasks has also been shown to degrade as the number of input images growsDaset al\.\([2026](https://arxiv.org/html/2606.10147#bib.bib29)\)\.

While AVLLMs introduce a new dimension to machine perception through the integration of sound and sight, the internal mechanisms underlying audio\-visual integration remain largely unstudied\. Concurrent work on AVLLMsSelvakumaret al\.\([2026](https://arxiv.org/html/2606.10147#bib.bib28)\)examines audio\-visual captioning and reports that cross\-modal integration concentrates in deep layers\. In contrast, information flow studies in VLMsZhanget al\.\([2025b](https://arxiv.org/html/2606.10147#bib.bib7)\)and VideoLLMsKimet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib6)\)locate cross\-modal integration at earlier\-to\-middle layers, where visual information flows to the prediction only through the language tokens\. Whether the information flow in AVLLMs aligns with these VLM and VideoLLM findings or departs from them, and how AVLLMs distribute their reliance on audio versus visual inputs along this flow, remains an open question\. In particular, no prior work has examined the role of audio along the information flow in MLLMs, and it is unclear whether audio behaves similarly to visual information or follows different pathways\. Additionally, in the multi\-input interleaved configuration, prior work in VLMs has characterized model behavior on multi\-image inputsParket al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib31)\); Leeet al\.\([2026](https://arxiv.org/html/2606.10147#bib.bib30)\); Daset al\.\([2026](https://arxiv.org/html/2606.10147#bib.bib29)\), but the underlying information flow has not been examined, neither for multi\-image inputs nor for the broader case where audio items are interleaved alongside visual items\. In this study, we trace how audio and visual information jointly flow through AVLLMs to form the prediction, mapping these pathways in both input configurations and characterizing how each modality contributes along the way\. Our key findings are as follows:

- •Audio\-visual information does not reach the deep layers:Video attention in later layers is dominated by attention artifacts that disproportionately attract attention, making attention allocation an unreliable indicator of information flow\.
- •Task requirements steer the model’s audio\-visual flow:The contribution of each modality to the prediction and the strength of interaction between audio and video vary with what the task requires, with more visual, auditory, or audio\-visual alignment content depending on which is needed to answer the question\.
- •Multiple independent audio\-visual inputs route through parallel paths:Independent audio and visual items interleaved with text route information to the prediction along multiple parallel paths, rather than through a single sequential path as in single audio\-visual videos\.
- •Tokens can be discarded after their information is transferred:Once a token’s content has been passed on, it can be discarded with minimal impact on accuracy or even slight improvement\. We show this across tasks and datasets, and across input configurations, with each token type discarded at the distinct layer where its information transfer completes\.

## 2Preliminary on audio\-visual large language models \(AVLLMs\)

##### Multimodal tokenization and sequence construction:

AVLLMs process a video with its audio track and a text instruction through an autoregressive transformer over an interleaved token sequence\. Let the video frames be𝒱∈ℝT×H×W×3\\mathcal\{V\}\\in\\mathbb\{R\}^\{T\\times H\\times W\\times 3\}, withTTframes at spatial resolutionH×WH\\times W\. The frames are passed through a vision encoder and projector to produceNVN\_\{V\}video tokens of dimensiondd, the audio track is processed by an audio encoder intoNAN\_\{A\}audio tokens of the same dimension, and the text instruction is tokenized intoNTN\_\{T\}text tokens\. For a single audio\-visual video input, AVLLMs preserve temporal alignment by interleaving audio and video tokens within fixed temporal windows\. LetCCdenote the number of windows, and let𝐕c\\mathbf\{V\}\_\{c\}and𝐀c\\mathbf\{A\}\_\{c\}be the visual and audio tokens within thecc\-th window\. Withsystem prompt,video,audio, andquestionsegments, the full input sequence to the language model is

ℐ=\[s1,…,sNS⏟system;𝐕1,𝐀1;…;𝐕C,𝐀C⏟single audio\-visual video;q1,…,qNQ⏟question\],\\mathcal\{I\}\\;=\\;\\Big\[\\;\\underbrace\{\\hbox\{\\pagecolor\{sysbg\}$s\_\{1\},\\ldots,s\_\{N\_\{S\}\}$\}\}\_\{\\text\{system\}\}\\;;\\;\\underbrace\{\\hbox\{\\pagecolor\{vidbg\}$\\mathbf\{V\}\_\{1\}$\},\\hbox\{\\pagecolor\{audbg\}$\\mathbf\{A\}\_\{1\}$\}\\;;\\;\\ldots\\;;\\;\\hbox\{\\pagecolor\{vidbg\}$\\mathbf\{V\}\_\{C\}$\},\\hbox\{\\pagecolor\{audbg\}$\\mathbf\{A\}\_\{C\}$\}\}\_\{\\text\{single audio\-visual video\}\}\\;;\\;\\underbrace\{\\hbox\{\\pagecolor\{qbg\}$q\_\{1\},\\ldots,q\_\{N\_\{Q\}\}$\}\}\_\{\\text\{question\}\}\\;\\Big\],\(1\)
wheres1,…,sNSs\_\{1\},\\ldots,s\_\{N\_\{S\}\}are system\-prompt tokens,q1,…,qNQq\_\{1\},\\ldots,q\_\{N\_\{Q\}\}are question tokens, and the total sequence length isN=NS\+NV\+NA\+NQN=N\_\{S\}\+N\_\{V\}\+N\_\{A\}\+N\_\{Q\}\. Beyond this single audio\-visual video setting, AVLLMs also process*multi\-input*sequences with multiple independent audio and visual items interleaved with text, which we describe in Section[5](https://arxiv.org/html/2606.10147#S5)\.

##### Causal self\-attention:

At each transformer layerℓ\\ell, the hidden states𝐇ℓ∈ℝN×d\\mathbf\{H\}^\{\\ell\}\\in\\mathbb\{R\}^\{N\\times d\}are projected into query, key, and value matrices𝐐ℓ=𝐇ℓ𝐖Qℓ\\mathbf\{Q\}^\{\\ell\}=\\mathbf\{H\}^\{\\ell\}\\mathbf\{W\}\_\{Q\}^\{\\ell\},𝐊ℓ=𝐇ℓ𝐖Kℓ\\mathbf\{K\}^\{\\ell\}=\\mathbf\{H\}^\{\\ell\}\\mathbf\{W\}\_\{K\}^\{\\ell\},𝐕ℓ=𝐇ℓ𝐖Vℓ\\mathbf\{V\}^\{\\ell\}=\\mathbf\{H\}^\{\\ell\}\\mathbf\{W\}\_\{V\}^\{\\ell\}, where𝐖Qℓ,𝐖Kℓ,𝐖Vℓ∈ℝd×dh\\mathbf\{W\}\_\{Q\}^\{\\ell\},\\mathbf\{W\}\_\{K\}^\{\\ell\},\\mathbf\{W\}\_\{V\}^\{\\ell\}\\in\\mathbb\{R\}^\{d\\times d\_\{h\}\}are learnable anddhd\_\{h\}is the per\-head dimension\. The attention output is

Attention\(𝐐ℓ,𝐊ℓ,𝐕ℓ\)=softmax\(𝐐ℓ\(𝐊ℓ\)⊤dh\+𝐌\)𝐕ℓ,\\mathrm\{Attention\}\(\\mathbf\{Q\}^\{\\ell\},\\mathbf\{K\}^\{\\ell\},\\mathbf\{V\}^\{\\ell\}\)\\;=\\;\\mathrm\{softmax\}\\\!\\left\(\\frac\{\\mathbf\{Q\}^\{\\ell\}\(\\mathbf\{K\}^\{\\ell\}\)^\{\\top\}\}\{\\sqrt\{d\_\{h\}\}\}\+\\mathbf\{M\}\\right\)\\mathbf\{V\}^\{\\ell\},\(2\)where𝐌∈ℝN×N\\mathbf\{M\}\\in\\mathbb\{R\}^\{N\\times N\}is a causal mask enforcing autoregressive decoding\.

## 3What attention patterns reveal about information flow?

To trace how audio\-visual information reaches the prediction, a natural starting point is to examine where the model directs its attention\. We do this on multiple\-choice question\-answering \(MCQ\) tasks, where the prediction is a single token \(the answer letter\), using Qwen2\.5\-OmniXuet al\.\([2025a](https://arxiv.org/html/2606.10147#bib.bib4)\)at 3B scale as our subject model\. We inspect the attention allocation of this last token which is the first generated token where the prediction is formed\. Specifically, we track its allocation across layers and across token categories \(system prompt,video,audio,user instruction\)\. Figure[1](https://arxiv.org/html/2606.10147#S3.F1)\(left\) shows that throughout most of the network, the last token attends predominantly to language tokens \(system promptanduser instruction\), and attention to multimodal tokens fades through the layers\. However, attention to video sharply spikes at layer 31 and remains elevated through the final layer\.

![Refer to caption](https://arxiv.org/html/2606.10147v1/x1.png)Figure 1:Attention to video sharply spikes at layer 31\.\(Left\) Attention allocation of the last token per layer and token category\. \(Middle, Right\) Attention maps at layers 30 and 31 of Qwen2\.5\-Omni 3B, with the vision sinks at layer 31 marked by red arrows\.Table 1:Masking attention to video and audio tokens at later layers \(31–35\) leaves AV\-SpeakerBench accuracy unchanged or slightly improved\.MaskAccuracyOriginal Casual Mask42\.24Mask video for last token42\.24Mask video for all text42\.31Mask video and audio for all text42\.52To understand why this spike emerges, we examine the attention maps at layers 30 and 31 \(Figure[1](https://arxiv.org/html/2606.10147#S3.F1)middle and right\)\. The spike is driven by a sparse set of visual tokens generally at the first visual position of frames, receiving concentrated attention at layer 31 but absent at layer 30\. This behavior of visual tokens matches the visual attention sinks identified inKanget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib8)\); Luoet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib35)\), and Figure[2](https://arxiv.org/html/2606.10147#S3.F2)confirms the sink\-token characteristic, with these tokens generally exhibiting much largerL2L\_\{2\}norms than the rest of the sequence and activating the same hidden dimensions as the language sink tokensXiaoet al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib33)\); Sunet al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib32)\); Guet al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib34)\)in the system prompt\. Therefore, the attention to these visual tokens is a mechanical artifact of their massive activation, not a sign of meaningful visual information\. This motivates the central question: if video attention in the later layers is dominated by sinks, does any audio\-visual information actually flow to the prediction through these layers? To answer this, we apply three masking conditions at layers 31 through the final layer and measure their impact on AV\-SpeakerBenchNguyenet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib1)\), an audio\-visual video MCQ benchmark \(Table[1](https://arxiv.org/html/2606.10147#S3.T1)\)\. Across all three conditions, accuracy is unchanged or marginally improved\. Despite the strong attention weights on the vision sinks, neither audio nor visual information flows through these later layers\.

![Refer to caption](https://arxiv.org/html/2606.10147v1/x2.png)Figure 2:Vision sinks share the same hidden\-state activation as the language sinks\.\(Left\) Hidden stateL2L\_\{2\}norm at layer 31, with vision sink tokens marked by red circles\. \(Right\) Magnitude per hidden dimension for a system sink and a vision sink, with massive activation peaks at dimensions318,1874, and1819for both, on Qwen2\.5\-Omni 3B\.Finding 1:Attention allocation is not a reliable indicator of information flow in AVLLMs\. Video attention in later layers is dominated by attention sinks, and audio\-visual information does not flow through these deep layers\.

## 4How do audio and visual information flow in audio\-visual videos?

In the previous section \(Section[3](https://arxiv.org/html/2606.10147#S3)\), we showed that attention allocation does not reliably reveal information flow\. To trace the information flow, we use causal interventions to invesitgate this, starting with the single audio\-visual video input\. Section[4\.2](https://arxiv.org/html/2606.10147#S4.SS2)examines within\- and cross\-modal interactions, and Section[4\.3](https://arxiv.org/html/2606.10147#S4.SS3)traces the route taken to the prediction\. We first describe the experimental setup in Section[4\.1](https://arxiv.org/html/2606.10147#S4.SS1)\. We then extend the analysis to the multi\-input interleaved configuration in Section[5](https://arxiv.org/html/2606.10147#S5)\.

### 4\.1Experimental setup

We use*Attention Knockout*Gevaet al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib9)\), a causal intervention that selectively blocks specific attention edges and measures the relative change in prediction probability\. We apply it on AV\-SpeakerBench\(Nguyenet al\.,[2025](https://arxiv.org/html/2606.10147#bib.bib1)\), an audio\-visual four\-way MCQ benchmark\. To ensure each knockout measures a degradation rather than a coincidental change, we run all interventions only on samples the model predicts correctly\.

##### Attention knockout:

Given a source set of token positions𝒮\\mathcal\{S\}\(the key side\) and a target set𝒯\\mathcal\{T\}\(the query side\), we modify the causal mask𝐌\\mathbf\{M\}from Section[2](https://arxiv.org/html/2606.10147#S2)at a chosen subset of layersℒ\\mathcal\{L\}such that𝐌i,jℓ=−∞\\mathbf\{M\}^\{\\ell\}\_\{i,j\}=\-\\inftyfor alli∈𝒯i\\in\\mathcal\{T\},j∈𝒮j\\in\\mathcal\{S\},ℓ∈ℒ\\ell\\in\\mathcal\{L\}, blocking query positions in𝒯\\mathcal\{T\}from attending to key positions in𝒮\\mathcal\{S\}while leaving all other attention edges intact\. We measure the effect via the relative change in the model’s probability of the predicted answer letter,Δp=\(pknockout−pbase\)/pbase\\Delta p=\(p\_\{\\text\{knockout\}\}\-p\_\{\\text\{base\}\}\)/p\_\{\\text\{base\}\}, wherepbasep\_\{\\text\{base\}\}is the probability under the original causal mask andpknockoutp\_\{\\text\{knockout\}\}is the probability after the mask modification\. A large negativeΔp\\Delta pindicates the blocked pathway carries information critical to the prediction, whileΔp≈0\\Delta p\\approx 0indicates it is dispensable\. FollowingGevaet al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib9)\); Zhanget al\.\([2025b](https://arxiv.org/html/2606.10147#bib.bib7)\); Kimet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib6)\), we localize where in the network a pathway operates by applying the knockout within a sliding window ofkkconsecutive layers centered at each layerℓ\\elland sweepingℓ\\ellacross the full depth, yielding a curve ofΔp\\Delta pversus center layer\. We usek=7k=7unless otherwise noted\. We denote a knockout pathway as𝒮↛𝒯\\mathcal\{S\}\\not\\to\\mathcal\{T\}; for example, Video↛\\not\\toQuestion denotes blocking the question tokens from attending to the video tokens\. We use𝒮↮𝒯\\mathcal\{S\}\\not\\leftrightarrow\\mathcal\{T\}for bidirectional knockouts, where each set serves as both source and target of the other in the interleaved audio\-video layout shown in Equation[1](https://arxiv.org/html/2606.10147#S2.E1)\.

##### Dataset and tasks:

Table 2:Five representative task categories in AV\-SpeakerBench, with example questions and two of four options per example \(separated by ‘;’\)\. Category color indicates the cross\-modal direction,visual anchor→\\toaudio answer,audio anchor→\\tovisual answer, andmixed\.TaskExample questionSpeech RecognitionWhat does the man in the blue shirt say just before he puts on a red jacket?*Options:*“Are you sure?”; “No way”Speech AttributesAmong the people who speak, who speaks the most quietly overall?*Options:*The man with glasses; the man with brown hairVisual RecognitionWhen does the man say “I have to tell you”?*Options:*Just before he sits down; just after he takes off his glassesSpeaker RecognitionWho speaks right before the notebook is opened?*Options:*Woman with blonde hair saying “I can’t”; man in black jacket saying “it’s a tiger”Speaker DetectionDoes the man in the black shirt speak after the woman hands him the ring?*Options:*No, he only cries; yes, he tries to defend himselfAV\-SpeakerBench is a speaker\-centric audio\-visual benchmark covering audio perception and visual understanding capabilities\. Each task follows an*anchor–target*design\(Nguyenet al\.,[2025](https://arxiv.org/html/2606.10147#bib.bib1)\)that requires cross\-modal understanding, where the*anchor*is a cue in the question text pointing to an event in one modality \(e\.g\., a visual action or a spoken phrase\), and the*answer*must be read off the opposite modality at the moment the anchor identifies\. We select a subset of the benchmark’s tasks and group them into five representative categories \(Table[2](https://arxiv.org/html/2606.10147#S4.T2)\)\. Full details of the task selection and grouping are in Appendix[D\.1](https://arxiv.org/html/2606.10147#A4.SS1)\.

##### Models:

We use Qwen2\.5\-OmniXuet al\.\([2025a](https://arxiv.org/html/2606.10147#bib.bib4)\)at 3B scale as the main subject model throughout our analysis\. Results for Qwen2\.5\-Omni 7B and Video\-SALMONN2 PlusTanget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib5)\)at 3B and 7B scales are reported in Appendices[G](https://arxiv.org/html/2606.10147#A7),[H](https://arxiv.org/html/2606.10147#A8), and[I](https://arxiv.org/html/2606.10147#A9)\.

### 4\.2Do the modalities interact within themselves or with each other, and where?

![Refer to caption](https://arxiv.org/html/2606.10147v1/x3.png)Figure 3:Within\- and cross\-modal interactions concentrate at early\-to\-middle layers\.Change in prediction probability when disconnecting within\-modality \(Cross\-frame, Cross\-audio\) and direct cross\-modal \(Audio↮\\not\\leftrightarrowVideo\) pathways, across layers and five AV\-SpeakerBench tasks\. Cross\-frame attention contributes across all tasks, while cross\-modal effects vary by task\.First, we investigate whether and where the two modalities interact within themselves and with each other\. FollowingKimet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib6)\), we block cross\-frame attention within the video stream, and we additionally block cross\-chunk attention within the audio stream and bidirection cross\-modal interaction between both modalities \(Audio↮\\not\\leftrightarrowVideo\)\. Figure[3](https://arxiv.org/html/2606.10147#S4.F3)shows several patterns\. First, all interactions are concentrated in the early\-to\-middle layers, with cross\-modal exchange peaking shortly alongside within\-modality interaction\. Second, cross\-frame attention contributes substantially to almost every task\. This is expected for visually\-grounded tasks like Visual Recognition and Speaker Recognition, but it also holds for tasks with audio\-related answers\. For example, in Speech Attributes, the question asks “who speaks the most quietly?” with the answer being a person’s identity \(e\.g\., “the man with glasses”\)\. Although the answer concerns audio \(relative loudness\), the model must first identify*which speaker*corresponds to each visual descriptor, which is a visual task, before it can answer\. Third, bidirectional cross\-modal interaction varies by task\. It carries substantial information for tasks that require fine\-grained audio–visual alignment \(Speech Recognition, Speaker Detection\), where it operates shortly alongside cross\-frame attention, but contributes little for tasks that can be solved through visual information alone\. Fourth, cross\-audio interaction has minimal impact across all tasks, likely because audio tokens within each chunk have already temporally interacted in the audio encoder before reaching the LLM, while video tokens lack such pre\-interaction and rely on cross\-frame attention within the LLM for temporal context\.

### 4\.3How and where does audio\-visual information travel to the prediction?

![Refer to caption](https://arxiv.org/html/2606.10147v1/x4.png)Figure 4:Overall audio\-visual information flow in AVLLMs\.Change in prediction probability across knockouts targeting the question and the last token\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\. The flow follows a single sequential pathway with no direct flow from the modalities to the last token\.Next, we trace how audio and visual information reach the prediction\. Figure[4](https://arxiv.org/html/2606.10147#S4.F4)reveals a clean indirect route, Modalities→\\toQuestion→\\toLast, aligning with prior findings on VLMsZhanget al\.\([2025b](https://arxiv.org/html/2606.10147#bib.bib7)\)and VideoLLMsKimet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib6)\)\. At mid layers, video and audio transfer their information into the question tokens, picking up where the modality interactions of Section[4\.2](https://arxiv.org/html/2606.10147#S4.SS2)left off\. Since the question is positioned after the modalities in the sequence, it acts as the aggregation point for audio\-visual content\. The contribution of each modality along this route is shaped by the task requirements\. Tasks that can be solved through visual contexts \(Visual Recognition, Speech Attributes, Speaker Recognition\) flow primarily through video, while tasks requiring fine\-grained audio information \(Speech Recognition, Speaker Detection\) draw from both modalities simultaneously\. At late layers, the question then carries this combined content to the last token, where the model prediction is formed\. A deeper analysis of which question components \(correct option, incorrect options, non\-option question\) carry the audio\-visual content is provided in Appendix[F\.1](https://arxiv.org/html/2606.10147#A6.SS1)\.

Finding 2:Audio\-visual information follows a single sequential pathway\. Within\- and cross\-modal interactions at early\-to\-mid layers transfer audio\-visual content into the question as the aggregation point, with their relative contribution shaped by the task requirements\. Subsequently, the question carries this information to the model’s prediction\.

## 5How does information flow across multiple interleaved inputs?

In Section[4](https://arxiv.org/html/2606.10147#S4), we traced information flow in the single audio\-visual video input\. However, real\-world prompts often arrive as multiple independent images and audio clips interleaved with text instructions\. We investigate this multi\-input interleaved setting, where these independent audio\-visual items are interleaved with the question text\. Section[5\.2](https://arxiv.org/html/2606.10147#S5.SS2)traces how information from these independent sources flows to the prediction, and Section[5\.3](https://arxiv.org/html/2606.10147#S5.SS3)reveals the information flow to the model’s decision\. We first describe the experimental setup in Section[5\.1](https://arxiv.org/html/2606.10147#S5.SS1)\.

### 5\.1Experimental setup for multiple interleaved audio\-visual inputs

##### Dataset and tasks:

Table 3:Example multi\-input interleaved prompts in AV\-Odyssey\.The model matches a singlereferencein one modality against fourcandidatesin the opposite modality, with thequestiontext describing the task\.DirectionExample promptA Ref→\\toI CandWhich instrument illustrated in images in\[img1\]\[img2\]\[img3\]\[img4\]do you think best matches audio\[audio1\]?*Options:*second image; fourth image; third image; first imageI Ref→\\toA CandWhich audio among\[audio1\]\[audio2\]\[audio3\]\[audio4\]best matches the scene shown in\[img1\]?*Options:*first audio; second audio; fourth audio; third audioAV\-OdysseyGonget al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib2)\)is a multi\-input audio\-visual benchmark where the inputs contain multiple independent images and audio clips interleaved with the question text\. We focus on the matching subset where the model matches a single*reference*\(Ref\) item in one modality against four*candidate*\(Cand\) items in the opposite modality, across two task directions, audio reference to image candidates \(A Ref→\\toI Cand\) and image reference to audio candidates \(I Ref→\\toA Cand\) as shown in Table[3](https://arxiv.org/html/2606.10147#S5.T3)\. Within this subset, samples vary in the order of candidates, reference, and question text\. We use the most common ordering in the dataset,*candidates, question, reference*, and report results averaged across the selected tasks\. The knockout results for individual tasks, task selection details, and additional dataset information are in Appendices[F\.3](https://arxiv.org/html/2606.10147#A6.SS3)and[D\.3\.1](https://arxiv.org/html/2606.10147#A4.SS3.SSS1)\.

##### Input structure:

Multi\-input interleaved samples consist of several independent items \(images and audio clips\) interleaved with the question text without temporal alignment\. We extend the notation from Section[2](https://arxiv.org/html/2606.10147#S2)as follows:𝐂c\\mathbf\{C\}\_\{c\}denotes the tokens of thecc\-th candidate,𝐐\\mathbf\{Q\}the tokens of the question text,𝐑\\mathbf\{R\}the tokens of the reference, and𝐎o\\mathbf\{O\}\_\{o\}the tokens of theoo\-th option letter\. Unlike Equation[1](https://arxiv.org/html/2606.10147#S2.E1)where the question text and option letters form a single block, the interleaved structure here separates them, with the reference appearing between the question text and the option letters\. The full input sequence withcandidates,question, andreferencesegments, is

ℐ=\[s1,…,sNS⏟system;𝐂1,𝐂2,𝐂3,𝐂4⏟candidates;𝐐⏟question;𝐑⏟reference;𝐎1,𝐎2,𝐎3,𝐎4⏟options\]\.\\mathcal\{I\}\\;=\\;\\Big\[\\;\\underbrace\{s\_\{1\},\\ldots,s\_\{N\_\{S\}\}\}\_\{\\text\{system\}\}\\;;\\;\\underbrace\{\\hbox\{\\pagecolor\{candbg\}$\\mathbf\{C\}\_\{1\},\\mathbf\{C\}\_\{2\},\\mathbf\{C\}\_\{3\},\\mathbf\{C\}\_\{4\}$\}\}\_\{\\text\{candidates\}\}\\;;\\;\\underbrace\{\\hbox\{\\pagecolor\{qbg\}$\\mathbf\{Q\}$\}\}\_\{\\text\{question\}\}\\;;\\;\\underbrace\{\\hbox\{\\pagecolor\{refbg\}$\\mathbf\{R\}$\}\}\_\{\\text\{reference\}\}\\;;\\;\\underbrace\{\\mathbf\{O\}\_\{1\},\\mathbf\{O\}\_\{2\},\\mathbf\{O\}\_\{3\},\\mathbf\{O\}\_\{4\}\}\_\{\\text\{options\}\}\\;\\Big\]\.\(3\)

### 5\.2How does the model route information across multiple interleaved inputs?

![Refer to caption](https://arxiv.org/html/2606.10147v1/x5.png)Figure 5:Multi\-input interleaved information aggregates at the late\-positioned token\.At mid layers, candidates interact among themselves \(Cross\-Candidate\), and both candidates and question transfer their content to the reference\. At late layers, only the reference reaches the last token\.To trace this flow, we apply attention knockout \(Section[4\.1](https://arxiv.org/html/2606.10147#S4.SS1.SSS0.Px1)\) across candidates, question, and reference\. Figure[5](https://arxiv.org/html/2606.10147#S5.F5)reveals the route Candidates \+ Question→\\toReference→\\toLast\. At mid layers, candidates exchange information among themselves \(Cross\-Candidate\), then both the candidates and the question flow into the reference\. The candidates and question reach the reference independently, with no flow between them\. At late layers, only the reference flows to the model’s prediction; blocking the candidates or question from the last token has negligible effect\. The reference therefore plays the role that the question played in audio\-visual videos \(Section[4\.3](https://arxiv.org/html/2606.10147#S4.SS3)\), with the late\-positioned token serving as the aggregation point in both settings\. Since the prediction is one of four answer\-choice letters, how does the reference’s content translate into a specific option? We answer this in Section[5\.3](https://arxiv.org/html/2606.10147#S5.SS3)\.

### 5\.3How is the final answer selected?

![Refer to caption](https://arxiv.org/html/2606.10147v1/x6.png)Figure 6:The prediction flows through the option letters\.\(a\-b\) At mid layers, the correct option letter \(CorrectOpt\) aggregates from the correct and incorrect candidates \(CorrectCand, IncorrectCand\) and the reference\. \(c\-d\) At late layers, the last token reads from both correct and incorrect option letters, with the competition between them driving the prediction\.Next, we trace how the model selects the correct option\. We knock out pathways into the correct option \(CorrectOpt\) and from the option letters into the last token\. At mid layers \(Figure[6](https://arxiv.org/html/2606.10147#S5.F6)a\-b\), CorrectOpt draws primarily from the correct candidate, with smaller contributions from the incorrect candidates and reference, and no flow from the question\. At late layers \(Figure[6](https://arxiv.org/html/2606.10147#S5.F6)c\-d\), both correct and incorrect \(IncorrectOpt\) option letters flow to the last token\. Blocking CorrectOpt↛\\not\\toLast suppresses the correct prediction, while blocking IncorrectOpt↛\\not\\toLast*increases*it, indicating the incorrect options actively compete with the correct one\. The decision therefore flows through the option letters, not directly from the candidates\. Together, Sections[5\.2](https://arxiv.org/html/2606.10147#S5.SS2)and[5\.3](https://arxiv.org/html/2606.10147#S5.SS3)reveal two parallel paths to the prediction: \(1\) Candidates \+ Question→\\toReference→\\toLast and \(2\) Candidates→\\toOption letters→\\toLast\. At mid layers, both paths aggregate candidate content, but only the reference receives question content, so the question reaches the prediction exclusively via the reference\. At late layers, the last token integrates from both paths independently\.

Finding 3:Unlike the single sequential path in audio\-visual videos, multi\-input interleaved information flows through two parallel paths to the prediction, each with its own late\-positioned aggregation point, integrating independently at the last token\.

## 6Do we still need multimodal and text tokens after information transfer?

Sections[4](https://arxiv.org/html/2606.10147#S4)and[5](https://arxiv.org/html/2606.10147#S5)establish that audio, visual, and non\-option question tokens transfer their content to the late\-positioned aggregation token \(with the non\-option question analysis for video reported in Appendix[F\.1\.2](https://arxiv.org/html/2606.10147#A6.SS1.SSS2)\)\. Combined with the attention sink result of Section[3](https://arxiv.org/html/2606.10147#S3), this implies that these tokens become dispensable from the sequence once their content has been transferred and can be discarded\. Previously,Zhanget al\.\([2025b](https://arxiv.org/html/2606.10147#bib.bib7)\)demonstrated this token removal for image tokens in VLMs andKimet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib6)\)validated effective pathways in VideoLLMs through attention masking\. We extend these findings to AVLLMs by discarding audio, visual, and non\-option question tokens, with each token type discarded at the distinct layer where its information transfer completes, applying this to both the single audio\-visual video and multi\-input interleaved configurations\. We evaluate discarding video, audio, and non\-option question tokens \(single video\) or candidates, reference, and non\-option question tokens \(multi\-input\), individually or all together\. Each setting covers the task analyzed in Sections[4](https://arxiv.org/html/2606.10147#S4)and[5](https://arxiv.org/html/2606.10147#S5)\. We further evaluate generalization on cross\-task and cross\-dataset settings using AV\-SpeakerBenchNguyenet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib1)\), AV\-OdysseyGonget al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib2)\), and WorldSenseHonget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib3)\)\(details in Appendix[D](https://arxiv.org/html/2606.10147#A4)\)\. Table[4](https://arxiv.org/html/2606.10147#S6.T4)shows that discarding has minimal impact on model prediction and generalizes across tasks and datasets, improving model efficiency\.

Finding 4:Multimodal and text tokens can be discarded after their information is transferred, with each token type discarded once its information transfer is complete, with minimal impact on accuracy or slight improvement\.

Table 4:Effect of discarding multi\-modal tokens on task accuracy and inference efficiency across analyzed task, cross\-task, and cross\-dataset settings\.LLdenotes the layer after which the tokens are discarded\. Multi\-input results are reported in both matching directions reference→\\tocandidates \( I→\\toA, A→\\toI\)\. Numbers in parentheses show change from baseline \(green = improvement,red = drop\);underlinedindicates no change\. Best per column inbold\.Sp\.: Speech;Vis\.: Visual;Rec\.: Recognition;Count\.: Counting;Vid\.: Video;Aud\.: Audio;Transp\.: Transportation;Ques: non\-option Question tokens;All: all token types together\.Video\(AV\-SpeakerBench knockout, Discard Video & Audio atL=26L=26, Ques atL=29L=29; cross\-dataset from WorldSense\)Config / TaskTasks in KnockoutCross\-taskCross\-datasetAvg\. PrefillLatency \(s\)Sp\. Rec\.Vis\. Rec\.Vis\. Count\.Sp\. Count\.Vid\. EmotionAud\. Change*Baseline*50\.2546\.5843\.926\.3966\.6742\.222288\.65Discard Ques50\.2546\.83\(\+0\.25\)44\.39\(\+0\.49\)26\.3966\.6740\.00\(\-2\.22\)2279\.97Discard Audio50\.2547\.55\(\+0\.97\)44\.39\(\+0\.49\)26\.74\(\+0\.35\)66\.6740\.00\(\-2\.22\)2232\.45Discard Video50\.75\(\+0\.50\)46\.10\(\-0\.48\)43\.926\.74\(\+0\.35\)66\.6740\.00\(\-2\.22\)2098\.75Discard All49\.75\(\-0\.50\)46\.59\(\+0\.01\)42\.93\(\-0\.97\)26\.04\(\-0\.35\)66\.6742\.222089\.47Multi\-input\(AV\-Odyssey knockout, Discard Cand atL=25L=25, Discard Ref atL=31L=31, Discard Ques atL=29L=29\)Config / TaskTasks in KnockoutCross\-taskAvg\. PrefillLatency \(ms\)Animal Rec\.Bird Rec\.Transp\. Rec\.A→\\toII→\\toAA→\\toII→\\toAA→\\toII→\\toA*Baseline*61\.0038\.0029\.4133\.6746\.6723\.16558\.75Discard Ques62\.00\(\+1\.00\)38\.0030\.39\(\+0\.98\)34\.69\(\+1\.02\)44\.76\(\-1\.91\)24\.21\(\+1\.05\)550\.41Discard Ref63\.00\(\+2\.00\)40\.00\(\+2\.00\)29\.4133\.6746\.6725\.26\(\+2\.10\)552\.11Discard Cand63\.00\(\+2\.00\)39\.00\(\+1\.00\)32\.35\(\+2\.94\)31\.63\(\-2\.04\)47\.62\(\+0\.95\)25\.26\(\+2\.10\)533\.07Discard All63\.00\(\+2\.00\)38\.0032\.35\(\+2\.94\)33\.6746\.6724\.21\(\+1\.05\)530\.62
## 7Discussion, future works and limitations

Discussion:We present the first comprehensive analysis of information flow in AVLLMs across single audio\-visual video and multi\-input interleaved configurations\. In both configurations, modality content reaches the prediction by being aggregated into a token positioned later in the sequence\. A plausible explanation for this aggregator emergence is causal attention, which makes tokens positioned later in the sequence structurally available to absorb upstream content, paired with the finding fromKimet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib6)\)that modality tokens progressively align with linguistic embeddings at mid layers, which could plausibly provide the semantic basis for cross\-modal information transfer\. Concurrent workSelvakumaret al\.\([2026](https://arxiv.org/html/2606.10147#bib.bib28)\)reports that cross\-modal integration concentrates in deep layers in captioning\. Our analysis offers an alternative view in the question\-answering setting, where attention to video tokens at the late layers is dominated by sinks \(Section[3](https://arxiv.org/html/2606.10147#S3)\) and discarding different token segments at the distinct layers where their transfer completes has minimal impact on performance \(Section[6](https://arxiv.org/html/2606.10147#S6)\)\. The actual integration takes place much earlier, at mid layers where modality tokens transfer to the aggregator\. What appears as deep\-layer integration in captioning may therefore reflect attention to sinks rather than meaningful integration, though captioning and question\-answering may engage distinct routing at late layers\.

Future work:Our findings open several research directions\. First, our finding that tokens can be discarded after their information is transferred \(Section[6](https://arxiv.org/html/2606.10147#S6)\) opens a new direction for AVLLM efficiency through token compression at the LLM’s internal layers, complementing existing input\-level methods that compress audio\-visual tokens before they enter the LLMGonget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib44)\); Dinget al\.\([2026](https://arxiv.org/html/2606.10147#bib.bib42)\); Taoet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib43)\); Li and Huang \([2026](https://arxiv.org/html/2606.10147#bib.bib45)\)\. Second, the task\-dependent modality contribution we observe \(Section[4](https://arxiv.org/html/2606.10147#S4)\) raises the question of whether steering AVLLMs to rebalance modality reliance could improve performance for tasks where one modality is underused\. Third, visual bias in AVLLMs has been reported fromSelvakumaret al\.\([2026](https://arxiv.org/html/2606.10147#bib.bib28)\)through counterfactual analysis where audio and video are intentionally mismatched\. Extending our information flow analysis to these conditions is a natural next step for understanding where visual bias emerges along the routing pathway\.

Limitations:Our analysis operates in the MCQ setting, where the prediction is a single answer letter\. Open\-ended generation tasks such as captioning or free\-form dialogue may engage distinct pathways that our setup does not capture\.

## 8Acknowledgments

We acknowledge EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain\. Use as many instances of the pattern MareNostrum5 as BSC, Spain as the number of systems awarded via EuroHPC\.

## References

- \[1\]X\. An, Y\. Xie, K\. Yang, W\. Zhang, X\. Zhao, Z\. Cheng, Y\. Wang, S\. Xu, C\. Chen, D\. Zhu,et al\.\(2025\)Llava\-onevision\-1\.5: fully open framework for democratized multimodal training\.arXiv preprint arXiv:2509\.23661\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[2\]S\. Bai, Y\. Cai, R\. Chen, K\. Chen, X\. Chen, Z\. Cheng, L\. Deng, W\. Ding, C\. Gao, C\. Ge,et al\.\(2025\)Qwen3\-vl technical report\.arXiv preprint arXiv:2511\.21631\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[3\]S\. Basu, M\. Grayson, C\. Morrison, B\. Nushi, S\. Feizi, and D\. Massiceti\(2024\)Understanding information storage and transfer in multi\-modal large language models\.Advances in Neural Information Processing Systems37,pp\. 7400–7426\.Cited by:[§A\.2](https://arxiv.org/html/2606.10147#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p2.1)\.
- \[4\]Z\. Cheng, S\. Leng, H\. Zhang, Y\. Xin, X\. Li, G\. Chen, Y\. Zhu, W\. Zhang, Z\. Luo, D\. Zhao,et al\.\(2024\)Videollama 2: advancing spatial\-temporal modeling and audio understanding in video\-llms\.arXiv preprint arXiv:2406\.07476\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[5\]Y\. Chu, J\. Xu, Q\. Yang, H\. Wei, X\. Wei, Z\. Guo, Y\. Leng, Y\. Lv, J\. He, J\. Lin,et al\.\(2024\)Qwen2\-audio technical report\.arXiv preprint arXiv:2407\.10759\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[6\]A\. Das, A\. Bulat, A\. Baldrati, I\. M\. Metaxas, B\. Schiele, G\. Tzimiropoulos, and B\. Martinez\(2026\)More images, more problems? a controlled analysis of vlm failure modes\.arXiv preprint arXiv:2601\.07812\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p2.1),[§1](https://arxiv.org/html/2606.10147#S1.p3.1)\.
- \[7\]Y\. Ding, Y\. Ji, J\. Li, X\. Liu, X\. Chen, J\. Wu, B\. Li, B\. Zeng, Y\. Shi, Y\. Guan,et al\.\(2026\)OmniSIFT: modality\-asymmetric token compression for efficient omni\-modal large language models\.arXiv preprint arXiv:2602\.04804\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p1.1),[§7](https://arxiv.org/html/2606.10147#S7.p2.1)\.
- \[8\]N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly,et al\.\(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread1\(1\),pp\. 12\.Cited by:[§A\.2](https://arxiv.org/html/2606.10147#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p2.1)\.
- \[9\]C\. Fu, H\. Lin, X\. Wang, Y\. Zhang, Y\. Shen, X\. Liu, H\. Cao, Z\. Long, H\. Gao, K\. Li,et al\.\(2025\)Vita\-1\.5: towards gpt\-4o level real\-time vision and speech interaction\.arXiv preprint arXiv:2501\.01957\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[10\]M\. Geva, J\. Bastings, K\. Filippova, and A\. Globerson\(2023\)Dissecting recall of factual associations in auto\-regressive language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 12216–12235\.Cited by:[§A\.2](https://arxiv.org/html/2606.10147#A1.SS2.p1.1),[Appendix E](https://arxiv.org/html/2606.10147#A5.p1.15),[§1](https://arxiv.org/html/2606.10147#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.10147#S4.SS1.SSS0.Px1.p1.23),[§4\.1](https://arxiv.org/html/2606.10147#S4.SS1.p1.1)\.
- \[11\]S\. Ghosh, S\. Kumar, A\. Seth, C\. K\. R\. Evuru, U\. Tyagi, S\. Sakshi, O\. Nieto, R\. Duraiswami, and D\. Manocha\(2024\)Gama: a large audio\-language model with advanced audio understanding and complex reasoning abilities\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 6288–6313\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[12\]A\. Goel, S\. Ghosh, J\. Kim, S\. Kumar, Z\. Kong, S\. Lee, C\. H\. Yang, R\. Duraiswami, D\. Manocha, R\. Valle,et al\.\(2025\)Audio flamingo 3: advancing audio intelligence with fully open large audio language models\.arXiv preprint arXiv:2507\.08128\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[13\]C\. Gong, D\. Wang, Z\. Wei, Y\. Guo, H\. Zhu, and J\. Chen\(2025\)EchoingPixels: cross\-modal adaptive token reduction for efficient audio\-visual llms\.arXiv preprint arXiv:2512\.10324\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p1.1),[§7](https://arxiv.org/html/2606.10147#S7.p2.1)\.
- \[14\]K\. Gong, K\. Feng, B\. Li, Y\. Wang, M\. Cheng, S\. Yang, J\. Han, B\. Wang, Y\. Bai, Z\. Yang,et al\.\(2024\)Av\-odyssey bench: can your multimodal llms really understand audio\-visual information?\.arXiv preprint arXiv:2412\.02611\.Cited by:[§D\.3](https://arxiv.org/html/2606.10147#A4.SS3.p1.1),[Appendix F](https://arxiv.org/html/2606.10147#A6.p1.1),[Appendix G](https://arxiv.org/html/2606.10147#A7.p1.1),[§5\.1](https://arxiv.org/html/2606.10147#S5.SS1.SSS0.Px1.p1.2),[§6](https://arxiv.org/html/2606.10147#S6.p1.1)\.
- \[15\]Y\. Gong, H\. Luo, A\. H\. Liu, L\. Karlinsky, and J\. Glass\(2023\)Listen, think, and understand\.arXiv preprint arXiv:2305\.10790\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[16\]X\. Gu, T\. Pang, C\. Du, Q\. Liu, F\. Zhang, C\. Du, Y\. Wang, and M\. Lin\(2024\)When attention sink emerges in language models: an empirical view\.arXiv preprint arXiv:2410\.10781\.Cited by:[§3](https://arxiv.org/html/2606.10147#S3.p2.1)\.
- \[17\]J\. Hong, S\. Yan, J\. Cai, X\. Jiang, Y\. Hu, and W\. Xie\(2025\)Worldsense: evaluating real\-world omnimodal understanding for multimodal llms\.arXiv preprint arXiv:2502\.04326\.Cited by:[§D\.2](https://arxiv.org/html/2606.10147#A4.SS2.p1.1),[Appendix F](https://arxiv.org/html/2606.10147#A6.p1.1),[Appendix G](https://arxiv.org/html/2606.10147#A7.p1.1),[Appendix H](https://arxiv.org/html/2606.10147#A8.p1.1),[Appendix I](https://arxiv.org/html/2606.10147#A9.p1.1),[§6](https://arxiv.org/html/2606.10147#S6.p1.1)\.
- \[18\]A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[19\]O\. Kaduri, S\. Bagon, and T\. Dekel\(2025\)What’s in the image? a deep\-dive into the vision of vision language models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 14549–14558\.Cited by:[§A\.2](https://arxiv.org/html/2606.10147#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p2.1)\.
- \[20\]S\. Kang, J\. Kim, J\. Kim, and S\. J\. Hwang\(2025\)See what you are told: visual attention sink in large multimodal models\.arXiv preprint arXiv:2503\.03321\.Cited by:[§3](https://arxiv.org/html/2606.10147#S3.p2.1)\.
- \[21\]M\. Kim, T\. Kim, and B\. Han\(2025\)Map the flow: revealing hidden pathways of information in videollms\.arXiv preprint arXiv:2510\.13251\.Cited by:[§A\.2](https://arxiv.org/html/2606.10147#A1.SS2.p1.1),[§F\.1\.2](https://arxiv.org/html/2606.10147#A6.SS1.SSS2.p1.3),[§1](https://arxiv.org/html/2606.10147#S1.p2.1),[§1](https://arxiv.org/html/2606.10147#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.10147#S4.SS1.SSS0.Px1.p1.23),[§4\.2](https://arxiv.org/html/2606.10147#S4.SS2.p1.1),[§4\.3](https://arxiv.org/html/2606.10147#S4.SS3.p1.2),[§6](https://arxiv.org/html/2606.10147#S6.p1.1),[§7](https://arxiv.org/html/2606.10147#S7.p1.1)\.
- \[22\]M\. Lee, Y\. Park, D\. Hwang, Y\. Kim, S\. J\. Oh, and J\. Choe\(2026\)Enhancing multi\-image understanding through delimiter token scaling\.arXiv preprint arXiv:2602\.01984\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p2.1),[§1](https://arxiv.org/html/2606.10147#S1.p3.1)\.
- \[23\]B\. Li and T\. Huang\(2026\)DASH: dynamic audio\-driven semantic chunking for efficient omnimodal token compression\.arXiv preprint arXiv:2603\.15685\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p1.1),[§7](https://arxiv.org/html/2606.10147#S7.p2.1)\.
- \[24\]B\. Li, Y\. Zhang, D\. Guo, R\. Zhang, F\. Li, H\. Zhang, K\. Zhang, P\. Zhang, Y\. Li, Z\. Liu,et al\.\(2024\)Llava\-onevision: easy visual task transfer\.arXiv preprint arXiv:2408\.03326\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[25\]C\. Li, Y\. Chen, Y\. Ji, J\. Xu, Z\. Cui, S\. Li, Y\. Zhang, W\. Wang, Z\. Song, D\. Zhang,et al\.\(2025\)Omnivideobench: towards audio\-visual understanding evaluation for omni mllms\.arXiv preprint arXiv:2510\.10689\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[26\]Y\. Li, Y\. Ma, G\. Zhang, R\. Yuan, K\. Zhu, H\. Guo, Y\. Liang, J\. Liu, Z\. Wang, J\. Yang,et al\.\(2024\)Omnibench: towards the future of universal omni\-language models\.arXiv preprint arXiv:2409\.15272\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[27\]Y\. Li, X\. Chen, S\. Jiang, H\. Shi, Z\. Liu, X\. Zhang, N\. Deng, Z\. Xu, Y\. Ma, M\. Zhang,et al\.\(2025\)Uni\-moe\-2\.0\-omni: scaling language\-centric omnimodal large model with advanced moe, training and data\.arXiv preprint arXiv:2511\.12609\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1)\.
- \[28\]H\. Liu, C\. Li, Y\. Li, and Y\. J\. Lee\(2024\)Improved baselines with visual instruction tuning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 26296–26306\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[29\]Z\. Liu, Y\. Dong, J\. Wang, Z\. Liu, W\. Hu, J\. Lu, and Y\. Rao\(2025\)Ola: pushing the frontiers of omni\-modal language model\.arXiv preprint arXiv:2502\.04328\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1)\.
- \[30\]J\. Luo, W\. Fan, L\. Wang, X\. He, T\. Rahman, P\. Abolmaesumi, and L\. Sigal\(2025\)To sink or not to sink: visual information pathways in large vision\-language models\.arXiv preprint arXiv:2510\.08510\.Cited by:[§3](https://arxiv.org/html/2606.10147#S3.p2.1)\.
- \[31\]N\. Nanda, L\. Chan, T\. Lieberum, J\. Smith, and J\. Steinhardt\(2023\)Progress measures for grokking via mechanistic interpretability\.arXiv preprint arXiv:2301\.05217\.Cited by:[§A\.2](https://arxiv.org/html/2606.10147#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p2.1)\.
- \[32\]C\. Neo, L\. Ong, P\. Torr, M\. Geva, D\. Krueger, and F\. Barez\(2024\)Towards interpreting visual information processing in vision\-language models\.arXiv preprint arXiv:2410\.07149\.Cited by:[§A\.2](https://arxiv.org/html/2606.10147#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p2.1)\.
- \[33\]L\. T\. P\. Nguyen, Z\. Yu, S\. L\. Y\. Hang, S\. An, J\. Lee, Y\. Ban, S\. Chung, T\. Nguyen, J\. Maeng, S\. Lee,et al\.\(2025\)See, hear, and understand: benchmarking audiovisual human speech understanding in multimodal large language models\.arXiv preprint arXiv:2512\.02231\.Cited by:[§D\.1](https://arxiv.org/html/2606.10147#A4.SS1.p1.1),[Appendix F](https://arxiv.org/html/2606.10147#A6.p1.1),[Appendix G](https://arxiv.org/html/2606.10147#A7.p1.1),[Appendix H](https://arxiv.org/html/2606.10147#A8.p1.1),[Appendix I](https://arxiv.org/html/2606.10147#A9.p1.1),[§3](https://arxiv.org/html/2606.10147#S3.p2.1),[§4\.1](https://arxiv.org/html/2606.10147#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.10147#S4.SS1.p1.1),[§6](https://arxiv.org/html/2606.10147#S6.p1.1)\.
- \[34\]Y\. Nikankin, D\. Arad, Y\. Gandelsman, and Y\. Belinkov\(2025\)Same task, different circuits: disentangling modality\-specific mechanisms in vlms\.arXiv preprint arXiv:2506\.09047\.Cited by:[§A\.2](https://arxiv.org/html/2606.10147#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p2.1)\.
- \[35\]Y\. Park, M\. Lee, S\. Chun, and J\. Choe\(2025\)Mitigating cross\-image information leakage in lvlms for multi\-image tasks\.arXiv preprint arXiv:2508\.13744\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p2.1),[§1](https://arxiv.org/html/2606.10147#S1.p3.1)\.
- \[36\]D\. Rai, Y\. Zhou, S\. Feng, A\. Saparov, and Z\. Yao\(2024\)A practical review of mechanistic interpretability for transformer\-based language models\.arXiv preprint arXiv:2407\.02646\.Cited by:[§A\.2](https://arxiv.org/html/2606.10147#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p2.1)\.
- \[37\]R\. Selvakumar, K\. Jayakumar, S\. Sakshi, S\. Ghosh, R\. Gao, and D\. Manocha\(2026\)Do audio\-visual large language models really see and hear?\.arXiv preprint arXiv:2604\.02605\.Cited by:[§A\.2](https://arxiv.org/html/2606.10147#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p2.1),[§1](https://arxiv.org/html/2606.10147#S1.p3.1),[§7](https://arxiv.org/html/2606.10147#S7.p1.1),[§7](https://arxiv.org/html/2606.10147#S7.p2.1)\.
- \[38\]L\. Sharkey, B\. Chughtai, J\. Batson, J\. Lindsey, J\. Wu, L\. Bushnaq, N\. Goldowsky\-Dill, S\. Heimersheim, A\. Ortega, J\. Bloom,et al\.\(2025\)Open problems in mechanistic interpretability\.arXiv preprint arXiv:2501\.16496\.Cited by:[§A\.2](https://arxiv.org/html/2606.10147#A1.SS2.p1.1)\.
- \[39\]M\. Sun, X\. Chen, J\. Z\. Kolter, and Z\. Liu\(2024\)Massive activations in large language models\.arXiv preprint arXiv:2402\.17762\.Cited by:[§3](https://arxiv.org/html/2606.10147#S3.p2.1)\.
- \[40\]C\. Tang, Y\. Li, Y\. Yang, J\. Zhuang, G\. Sun, W\. Li, Z\. Ma, and C\. Zhang\(2025\)Video\-salmonn 2: caption\-enhanced audio\-visual large language models\.arXiv preprint arXiv:2506\.15220\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1),[Appendix C](https://arxiv.org/html/2606.10147#A3.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2606.10147#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.10147#S4.SS1.SSS0.Px3.p1.1)\.
- \[41\]C\. Tang, W\. Yu, G\. Sun, X\. Chen, T\. Tan, W\. Li, L\. Lu, Z\. Ma, and C\. Zhang\(2023\)Salmonn: towards generic hearing abilities for large language models\.arXiv preprint arXiv:2310\.13289\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[42\]K\. Tao, K\. Shao, B\. Yu, W\. Wang, H\. Wang,et al\.\(2025\)OmniZip: audio\-guided dynamic token compression for fast omnimodal large language models\.arXiv preprint arXiv:2511\.14582\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p1.1),[§7](https://arxiv.org/html/2606.10147#S7.p2.1)\.
- \[43\]G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[44\]Q\. Team\(2026\)Qwen3\. 5\-omni technical report\.arXiv preprint arXiv:2604\.15804\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[45\]S\. Tong, E\. Brown, P\. Wu, S\. Woo, M\. Middepogu, S\. C\. Akula, J\. Yang, S\. Yang, A\. Iyer, X\. Pan,et al\.\(2024\)Cambrian\-1: a fully open, vision\-centric exploration of multimodal llms\.Advances in Neural Information Processing Systems37,pp\. 87310–87356\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[46\]W\. Wang, Z\. Gao, L\. Gu, H\. Pu, L\. Cui, X\. Wei, Z\. Liu, L\. Jing, S\. Ye, J\. Shao,et al\.\(2025\)Internvl3\. 5: advancing open\-source multimodal models in versatility, reasoning, and efficiency\.arXiv preprint arXiv:2508\.18265\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[47\]Y\. Wei, Y\. Miao, D\. Zhou, and D\. Hu\(2025\)Moka: multimodal low\-rank adaptation for mllms\.arXiv preprint arXiv:2506\.05191\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[48\]G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis\(2023\)Efficient streaming language models with attention sinks\.arXiv preprint arXiv:2309\.17453\.Cited by:[Appendix B](https://arxiv.org/html/2606.10147#A2.p1.1),[§3](https://arxiv.org/html/2606.10147#S3.p2.1)\.
- \[49\]J\. Xu, Z\. Guo, J\. He, H\. Hu, T\. He, S\. Bai, K\. Chen, J\. Wang, Y\. Fan, K\. Dang,et al\.\(2025\)Qwen2\. 5\-omni technical report\.arXiv preprint arXiv:2503\.20215\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1),[Appendix C](https://arxiv.org/html/2606.10147#A3.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2606.10147#S1.p1.1),[§3](https://arxiv.org/html/2606.10147#S3.p1.1),[§4\.1](https://arxiv.org/html/2606.10147#S4.SS1.SSS0.Px3.p1.1)\.
- \[50\]J\. Xu, Z\. Guo, H\. Hu, Y\. Chu, X\. Wang, J\. He, Y\. Wang, X\. Shi, T\. He, X\. Zhu,et al\.\(2025\)Qwen3\-omni technical report\.arXiv preprint arXiv:2509\.17765\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[51\]Y\. Yang, J\. Zhuang, G\. Sun, C\. Tang, Y\. Li, P\. Li, Y\. Jiang, W\. Li, Z\. Ma, and C\. Zhang\(2025\)Audio\-centric video understanding benchmark without text shortcut\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 6580–6598\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[52\]H\. Ye, C\. H\. Yang, A\. Goel, W\. Huang, L\. Zhu, Y\. Su, S\. Lin, A\. Cheng, Z\. Wan, J\. Tian,et al\.\(2025\)OmniVinci: enhancing architecture and data for omni\-modal understanding llm\.arXiv preprint arXiv:2510\.15870\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1)\.
- \[53\]Q\. Ye, Z\. Yu, R\. Shao, X\. Xie, P\. Torr, and X\. Cao\(2024\)Cat: enhancing multimodal large language model to answer questions in dynamic audio\-visual scenarios\.InEuropean Conference on Computer Vision,pp\. 146–164\.Cited by:[§A\.1](https://arxiv.org/html/2606.10147#A1.SS1.p1.1)\.
- \[54\]B\. Zhang, K\. Li, Z\. Cheng, Z\. Hu, Y\. Yuan, G\. Chen, S\. Leng, Y\. Jiang, H\. Zhang, X\. Li,et al\.\(2025\)Videollama 3: frontier multimodal foundation models for image and video understanding\.arXiv preprint arXiv:2501\.13106\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.
- \[55\]Z\. Zhang, S\. Yadav, F\. Han, and E\. Shutova\(2025\)Cross\-modal information flow in multimodal large language models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 19781–19791\.Cited by:[§A\.2](https://arxiv.org/html/2606.10147#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.10147#S1.p2.1),[§1](https://arxiv.org/html/2606.10147#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.10147#S4.SS1.SSS0.Px1.p1.23),[§4\.3](https://arxiv.org/html/2606.10147#S4.SS3.p1.2),[§6](https://arxiv.org/html/2606.10147#S6.p1.1)\.
- \[56\]Z\. Zhou, R\. Wang, Z\. Wu, and Y\. Jiang\(2025\)Daily\-omni: towards audio\-visual reasoning with temporal alignment across modalities\.arXiv preprint arXiv:2505\.17862\.Cited by:[§1](https://arxiv.org/html/2606.10147#S1.p1.1)\.

## Appendix ARelated works

### A\.1Audio\-visual large language models \(AVLLMs\)

Audio\-visual large language models \(AVLLMs\) extend the multimodal LLM paradigm to jointly process audio and visual inputs\. The first generation of AVLLMsChenget al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib23)\); Tanget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib5)\); Yeet al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib53)\); Liuet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib54)\)couples dedicated audio and visual encoders with a language model to support text\-based audio\-visual question answering and dialogue\. Building on this foundation, omni modelsXuet al\.\([2025a](https://arxiv.org/html/2606.10147#bib.bib4),[b](https://arxiv.org/html/2606.10147#bib.bib10)\); Team \([2026](https://arxiv.org/html/2606.10147#bib.bib36)\); Yeet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib11)\); Liet al\.\([2025b](https://arxiv.org/html/2606.10147#bib.bib55)\); Fuet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib37)\)introduce temporal alignment between audio and visual streams and unify perception with end\-to\-end speech generation, pushing AVLLMs toward real\-time multimodal interaction\. Alongside these architectural advances, a growing body of work targets the inference efficiency of AVLLMs by compressing audio and visual tokens prior to the language modelGonget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib44)\); Dinget al\.\([2026](https://arxiv.org/html/2606.10147#bib.bib42)\); Taoet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib43)\); Li and Huang \([2026](https://arxiv.org/html/2606.10147#bib.bib45)\)\. Distinct from these prior works, we perform a mechanistic interpretability study of AVLLMs in this study, tracing how audio and visual information flow through the model to form the prediction\.

### A\.2Mechanistic interpretability of LLMs and MLLMs

Mechanistic interpretabilitySharkeyet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib56)\); Raiet al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib51)\); Nandaet al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib50)\); Elhageet al\.\([2021](https://arxiv.org/html/2606.10147#bib.bib52)\); Gevaet al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib9)\)is the study of how internal computations in neural networks give rise to their behavior, and has become a popular research direction that recently extends from LLMs to MLLMs\. In LLMs, prior studies have uncovered the algorithms underlying grokkingNandaet al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib50)\)and the attention head circuits behind transformer computationElhageet al\.\([2021](https://arxiv.org/html/2606.10147#bib.bib52)\)\. Among the methodological tools developed in this field, attention knockoutGevaet al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib9)\)causally intervenes on attention pathways to identify the routes information takes through the model\. Within MLLMs, recent work has examined information storageBasuet al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib24)\), modality\-specific circuitsNikankinet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib25)\), and visual information processingNeoet al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib26)\); Kaduriet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib27)\)\. Most relevant to our study,Zhanget al\.\([2025b](https://arxiv.org/html/2606.10147#bib.bib7)\)andKimet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib6)\)apply attention knockout to trace information flow in image VLMs and VideoLLMs, respectively\. Concurrent workSelvakumaret al\.\([2026](https://arxiv.org/html/2606.10147#bib.bib28)\)investigates AVLLMs through counterfactual analysis on audio\-visual captioning\. Distinct from these prior works, we trace the internal mechanism of how audio and visual information flow inside AVLLMs in both audio\-visual video and multi\-input interleaved scenarios to form the prediction in this study\.

## Appendix BAdditional visualizations of attention sink tokens

![Refer to caption](https://arxiv.org/html/2606.10147v1/x7.png)Figure 7:L2 norm distribution across token positions at four representative layers \(0, 15, 30, 31\)\. Tokens are colored by type \(system, video, audio, user instruction\)\. High\-norm sink tokens are highlighted with red circles\. The language sink emerges at layer 15 in the system prompt region, while vision sinks emerge later at layer 31 in the video sequence\.To trace when each type of sink emerges across layers, we analyze the L2 norm distribution at four representative layers \(0, 15, 30, 31\), shown in Figure[7](https://arxiv.org/html/2606.10147#A2.F7)\. At layer 0, vision and audio tokens already have higher norms than language tokens, since they have already passed through the modality encoders before entering the LLM\. By layer 15, a single high\-norm token has emerged in the system prompt region, corresponding to the language sink described inXiaoet al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib33)\)\. This language sink persists through layer 30, while modality tokens remain at low norms with no vision sink yet present\. At layer 31, vision sinks emerge sharply at specific positions in the video sequence, marking the layer at which the visual anchors identified in Section[3](https://arxiv.org/html/2606.10147#S3)first appear\. This progression confirms that vision sinks emerge later than the language sink, despite sharing the same mechanical signature\.

## Appendix CExperimental details

##### Models and inputs:

We analyze Qwen2\.5\-Omni 3BXuet al\.\([2025a](https://arxiv.org/html/2606.10147#bib.bib4)\)as the primary model, with additional results on its 7B variants and Video\-SALMONN2 Plus 3B and 7BTanget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib5)\)in Appendix[G](https://arxiv.org/html/2606.10147#A7),[H](https://arxiv.org/html/2606.10147#A8)and[I](https://arxiv.org/html/2606.10147#A9)\. All models are loaded from their official Hugging Face checkpoints\. Videos are sampled at 2 FPS with up to 128 frames per video, where each frame is represented by up to a 12×\\times12 grid of visual tokens, and standalone images are represented by up to a 24×\\times24 grid of visual tokens\.

##### Inference setup:

All experiments use greedy decoding \(do\_sample=False\) for deterministic outputs and are conducted on a single NVIDIA H100 GPU\. To ensure each knockout measures a real degradation rather than a coincidental change, we run all interventions only on samples that the model predicts correctly under the no\-knockout baseline\. A typical knockout run takes around 30 minutes for the audio\-visual video setting and 15 minutes for the multi\-input interleaved setting\.

## Appendix DDataset details

This section provides additional details on the datasets used in our analysis\. AV\-SpeakerBench \(Section[D\.1](https://arxiv.org/html/2606.10147#A4.SS1)\) is the audio\-visual video benchmark used in the main paper, AV\-Odyssey \(Section[D\.3](https://arxiv.org/html/2606.10147#A4.SS3)\) is the multi\-input audio\-visual benchmark also used in the main paper, and WorldSense \(Section[D\.2](https://arxiv.org/html/2606.10147#A4.SS2)\) is an additional audio\-visual video benchmark used in this appendix to provide further evidence on the generality of our findings\.

### D\.1AV\-SpeakerBench \(audio\-visual video\)

AV\-SpeakerBench\(Nguyenet al\.,[2025](https://arxiv.org/html/2606.10147#bib.bib1)\)is a benchmark for evaluating audio–visual reasoning in models that jointly process video and speech\. Each sample is a four\-way multiple\-choice question paired with an audio\-visual video, and the benchmark contains 12 tasks across audio\-centric, speaker\-centric, and visual\-centric domains\.

##### Cross\-modal anchor design:

A key feature of AV\-SpeakerBench is that each question is constructed around an explicit*anchor–target*structure that forces genuine audio–visual integration\. The anchor is a cue described in the question text that points to a specific moment in one modality, and the answer must be read off the opposite modality at that moment\. We summarize the three task types as follows:

- •Audio\-centric tasks\(visual anchor→\\toaudio answer\): the question describes a visual cue \(e\.g\.,*“after the man in the grey shirt wiggles his fingers”*\), and the model must first locate that visual moment, then listen to the audio within that window to extract the answer \(e\.g\., counting how often a phrase is spoken\)\.
- •Visual\-centric tasks\(audio anchor→\\tovisual answer\): the question describes an audio cue, typically a spoken phrase \(e\.g\.,*“after the woman says This is very datable”*\), and the model must locate that moment in the audio, then inspect the video at that timestamp to extract a visual answer \(e\.g\., counting visible people\)\.
- •Speaker\-centric tasks\(mixed\): the question may use either a visual or audio anchor, and the answer choices differ in the opposite modality \(e\.g\., visually distinct speakers who utter different lines\)\. Solving these requires jointly tracking identity, timing, and modality, which prevents unimodal shortcuts\.

This anchor structure makes AV\-SpeakerBench particularly well\-suited for our analysis, since the cross\-modal pathway is built directly into the question design rather than emerging incidentally from the content\.

##### Task selection for attention knockout:

For our analysis, we focus on tasks that exhibit the cross\-modal anchor structure described above and exclude tasks where the question can be largely answered from a single modality without the anchor playing a meaningful role\. The 8 selected tasks, totaling 2,281 samples, are organized into 5 categories used throughout Section[4](https://arxiv.org/html/2606.10147#S4)\. Table[5](https://arxiv.org/html/2606.10147#A4.T5)\(left\) lists the selected tasks with their domain and category, and \(right\) reports the per\-category sample counts\. Categories with multiple tasks \(Visual Recognition, Speech Attributes\) merge tasks that share the same anchor direction and answer modality; the other three categories correspond to a single task each\.

Table 5:Left:The 8 tasks selected from AV\-SpeakerBench, with their domain, sample counts, and the category each task is grouped into for our analysis\.Right:The 5 categories with their aggregated sample counts\. Domain color indicates the cross\-modal direction:visual anchor→\\toaudio answer,audio anchor→\\tovisual answer, andmixed\.DomainTask\# SamplesCategoryAudio\-centricSpeech Intensity206Speech AttributesSpeech Pitch206Speech AttributesSpeech Rate209Speech AttributesSpeech Recognition201Speech RecognitionSpeaker\-centricSpeaker Detection427Speaker DetectionSpeaker Recognition422Speaker RecognitionVisual\-centricActivity Recognition206Visual RecognitionAttribute Recognition204Visual RecognitionTotal2,281
Category\# SamplesSpeech Attributes621Speech Recognition201Visual Recognition410Speaker Recognition422Speaker Detection427Total2,281

##### Task selection for cross\-task evaluation:

In addition to the analysis tasks, for Section[6](https://arxiv.org/html/2606.10147#S6)we use Speech Counting and Visual Counting from AV\-SpeakerBench\. Table[6](https://arxiv.org/html/2606.10147#A4.T6)reports the domain and sample count for each\.

Table 6:Cross\-task evaluation tasks from AV\-SpeakerBench used in Section[6](https://arxiv.org/html/2606.10147#S6)\. Domain color indicates the cross\-modal direction:visual anchor→\\toaudio answer,audio anchor→\\tovisual answer\.DomainTask\# SamplesAudio\-centricSpeech Counting288Visual\-centricVisual Counting205Total493

### D\.2WorldSense \(audio\-visual video\)

WorldSense\(Honget al\.,[2025](https://arxiv.org/html/2606.10147#bib.bib3)\)is a benchmark of audio\-visual videos paired with multiple\-choice questions, designed to evaluate joint reasoning over visual, audio, and temporal information\. The benchmark covers 26 task types organized into three task domains: Recognition \(identifying entities or events\), Reasoning \(inferring causal or relational structure\), and Understanding \(comprehending temporal or contextual states\)\.

##### Task selection for attention knockout:

WorldSense contains videos ranging from a few seconds to over eight minutes\. To keep the audio\-visual stream short enough for tractable knockout sweeps and to ensure that all selected samples fit within a comparable input length, we filter to videos under one minute long\. From this filtered set, we select the 10 task types listed in Table[7](https://arxiv.org/html/2606.10147#A4.T7), totaling 418 samples\. The selected videos have a mean duration of 47\.4 seconds \(median 49 seconds, range 17–60 seconds\)\. Compared to AV\-SpeakerBench, WorldSense provides substantially fewer samples per task \(typically 25–64 per task type, against 200–400 in AV\-SpeakerBench\)\. As a result, the per\-task knockout curves on WorldSense are visibly noisier than those on AV\-SpeakerBench\.

Table 7:The 10 tasks selected from WorldSense for attention knockout, restricted to videos under one minute\. The*Domain*column indicates the WorldSense category each task belongs to\.DomainTask\# SamplesRecognitionAttribute Recognition52Audio Counting38Audio Source Localization64Event Recognition30Scene Recognition27ReasoningEmotion Change37Object State Change25UnderstandingEvent Sorting46Spatial Relation58Text and Diagram Understanding41Total418
##### Task selection for cross\-dataset evaluation:

In addition to the tasks used for attention knockout, for Section[6](https://arxiv.org/html/2606.10147#S6)we use Video Emotions and Audio Change from WorldSense as the cross\-dataset evaluation\. Table[8](https://arxiv.org/html/2606.10147#A4.T8)reports the domain and sample count for each\.

Table 8:Cross\-dataset evaluation tasks from WorldSense used in Section[6](https://arxiv.org/html/2606.10147#S6)\.DomainTask\# SamplesReasoningAudio Change45UnderstandingVideo Emotions27

### D\.3AV\-Odyssey \(multi\-input audio\-visual interleaved\)

AV\-Odyssey\(Gonget al\.,[2024](https://arxiv.org/html/2606.10147#bib.bib2)\)is a benchmark for audio\-visual understanding that contains 26 tasks spanning multiple reasoning skills, where each sample interleaves multiple independent images and audio clips with text\. Among these, we focus on the matching subset where the model selects an answer by comparing a singlereferenceitem in one modality against multiplecandidateitems in the opposite modality, which aligns with the multi\-input setting analyzed in Section[5](https://arxiv.org/html/2606.10147#S5)\. This appendix provides additional details on how we process AV\-Odyssey for our analysis: Section[D\.3\.1](https://arxiv.org/html/2606.10147#A4.SS3.SSS1)describes the procedure used to assign each sample an input structure label, and Section[D\.3\.2](https://arxiv.org/html/2606.10147#A4.SS3.SSS2)lists the specific tasks we select and reports the distribution of input structures across them\.

#### D\.3\.1Input structure assignment for AV\-Odyssey

AV\-Odyssey samples vary in how thecandidatemedia,referencemedia, andquestiontext are arranged within the prompt\. To support per\-ordering analysis, we automatically classify each sample’s input structure through the following procedure:

1. 1\.Parse media placeholders\.We scan the prompt to locate each media placeholder \(e\.g\.,\[img1\],\[audio1\]\) and identify the singlereferencemedia along with thecandidatemedia in the opposite modality\.
2. 2\.Split into ordered segments\.Using the located media placeholders as boundaries, we split the prompt into an ordered sequence of segments alternating between media spans \(candidatesandreference\) and text spans\.
3. 3\.Identify the question text\.Among the text segments, we assign the role of*questiontext*to the segment that describes the actual matching task\. If thereferencemedia is the final media in the prompt, we prefer the text segment immediately preceding it \(which typically contains the matching prompt, e\.g\.,*“which best matches”*\); otherwise, we select the longest text segment by word count\. All remaining text segments are treated as padding and excluded from the analysis\.
4. 4\.Construct the structure label\.The final segment ordering, for examplecandidates,question,reference, is recorded as the sample’s structure label\.

We use the resulting structure labels to bucket samples for both the main\-paper analysis \(which uses the most common ordering,candidates,question,reference\) and the per\-ordering breakdowns reported in this appendix\.

#### D\.3\.2Selected tasks from AV\-Odyssey

##### Task selection for attention knockout:

For our knockout analysis, we select 7 tasks from AV\-Odyssey that use either a singlereferencemedia against multiplecandidatesof the opposite modality or vice versa\. Table[9](https://arxiv.org/html/2606.10147#A4.T9)lists the 7 selected tasks, totaling 1,304 samples\.

Table 9:The 7 tasks selected from AV\-Odyssey for attention knockout\. A Ref → I Cand denotes audio reference to image candidates , and I Ref → A Cand denotes image reference to audio candidates \.Task\# SamplesA Ref→\\toI CandI Ref→\\toA CandInstrument Recognition20049%51%Animal Recognition20050%50%Material Recognition200100%—Hazard Recognition108100%—Action Recognition196100%—Music Genre Classification20052%48%Film and Music Matching200100%—Total1,30477%23%Within each task, the prompts can also vary in how thecandidates,questiontext, andreferenceare ordered\. Table[10](https://arxiv.org/html/2606.10147#A4.T10)shows the aggregate distribution across all selected samples, and Table[11](https://arxiv.org/html/2606.10147#A4.T11)provides the per\-task breakdown\. The dominant ordering iscandidates,question,reference, which we use for the main paper analysis; the remaining orderings are analyzed individually in this appendix\.

Table 10:Aggregate distribution of prompt structures across all 1,304 selected samples\.Structure\# SamplesProportioncandidates,question,reference1,00177%question,reference,candidates16112%reference,candidates,question12410%reference,question,candidates181%Table 11:Per\-task distribution of prompt structures across the 7 selected tasks\.*Cand*=candidates,*Q*=questiontext,*Ref*=reference\. Each row reports the sample counts for each structure observed in that task\. Empty cells indicate the structure does not appear in that task\.StructureTaskCand,Q,RefQ,Ref,CandRef,Cand,QRef,Q,CandInstrument Recognition46154——Animal Recognition200———Material Recognition1517357Hazard Recognition97——11Action Recognition196———Music Genre Classification164—36—Film and Music Matching147—53—Total1,00116112418
##### Task selection for cross\-task evaluation:

In addition to the tasks used for attention knockout, for Section[6](https://arxiv.org/html/2606.10147#S6)we use Bird Recognition and Transportation Recognition from AV\-Odyssey\. Table[12](https://arxiv.org/html/2606.10147#A4.T12)reports the sample count and input layout distribution for each\.

Table 12:Cross\-task evaluation tasks from AV\-Odyssey used in Section[6](https://arxiv.org/html/2606.10147#S6)\. A Ref → I Cand denotes audio reference to image candidates , and I Ref → A Cand denotes image reference to audio candidates\.Task\# SamplesA Ref→\\toI CandI Ref→\\toA CandBird Recognition20051%39%Transportation Recognition20052%48%

## Appendix EAblation on window size for attention knockout

![Refer to caption](https://arxiv.org/html/2606.10147v1/x8.png)Figure 8:Window size ablation on the Speaker Recognition task\. Each panel shows the relative change in prediction probability for three pathways \(Video↛\\not\\toQuestion, Audio↛\\not\\toQuestion, Question↛\\not\\toLast\) under a different window sizekk\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.FollowingGevaet al\.\([2023](https://arxiv.org/html/2606.10147#bib.bib9)\), attention knockout is applied within a sliding window ofkklayers around each target layer\. We ablatek∈\{1,3,5,7,9,11\}k\\in\\\{1,3,5,7,9,11\\\}on the Speaker Recognition task to select the most informative window size\. Figure[8](https://arxiv.org/html/2606.10147#A5.F8)shows that small windows \(k=1,3k=1,3\) produce shallow noisy drops as the narrow block is easily bypassed by remaining attention edges\. Large windows \(k=11k=11\) instead blur the layer localization, with the Question→\\toLast drop spanning a much wider range and losing the recovery point where the pathway concludes\. Bothk=7k=7andk=9k=9give clean and well\-localized drops, but we adoptk=7k=7as it preserves the recovery point of the Question→\\toLast pathway, whereask=9k=9tends to merge the recovery into the broader drop similarly tok=11k=11\. The choice ofk=7k=7overk=9k=9also accounts for model scale\. The 7B variants have only 28 layers compared to 36 in the 3B model, so a window ofk=9k=9would cover a relatively larger portion of the network in 7B and obscure the precise layer where each pathway operates\. Adoptingk=7k=7across both scales preserves localization accuracy in 7B while remaining valid for the 3B model\.

## Appendix FAdditional results on Qwen2\.5\-Omni 3B

This appendix provides additional knockout analyses for Qwen2\.5\-Omni 3B, extending the AV\-SpeakerBenchNguyenet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib1)\)results from Section[4](https://arxiv.org/html/2606.10147#S4)and reporting attention knockout on WorldSenseHonget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib3)\)and AV\-OdysseyGonget al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib2)\)\. While AV\-Odyssey is also analyzed in the main paper, the results there are averaged across tasks\. Here we instead report per\-task knockouts to show that the same pattern holds for each individual task\.

### F\.1Extended knockout analyses on AV\-SpeakerBench \(audio\-visual video\)

This subsection extends the AV\-SpeakerBench analysis of Section[4](https://arxiv.org/html/2606.10147#S4)with additional knockouts on cross\-modal directions, joint audio\-visual routing, and the question’s internal components\.

#### F\.1\.1Further analysis of cross\-modal directions and joint audio\-visual routing

![Refer to caption](https://arxiv.org/html/2606.10147v1/x9.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x10.png)

Figure 9:Additional knockout analysis on Qwen2\.5\-Omni 3B\.Top:direction\-specific cross\-modal knockouts \(Video↛\\not\\toAudio and Audio↛\\not\\toVideo\)\.Bottom:joint knockouts where audio and video serve as sources \(V\+A↛\\not\\toQuestion and V\+A↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.We extend the analysis of Section[4](https://arxiv.org/html/2606.10147#S4)with two follow\-up analyses\. The top panel of Figure[9](https://arxiv.org/html/2606.10147#A6.F9)decomposes the bidirectional cross\-modal knockout of Section[4\.2](https://arxiv.org/html/2606.10147#S4.SS2)into its two unidirectional components, Video↛\\not\\toAudio and Audio↛\\not\\toVideo\. The Video→\\toAudio direction is the dominant one on tasks that require fine\-grained audio\-visual alignment \(Speech Recognition, Speaker Detection\), accounting for most of the bidirectional cross\-modal effect observed in Section[4\.2](https://arxiv.org/html/2606.10147#S4.SS2), while Audio→\\toVideo plays a smaller role\. The remaining tasks remain near zero in both directions, consistent with Section[4\.2](https://arxiv.org/html/2606.10147#S4.SS2)\. We hypothesize that this asymmetry stems from the input ordering of the time\-aligned sequence, where each video frame is followed by its corresponding audio segment\. Audio tokens can therefore attend back to the time\-aligned video frame, while video tokens can only attend to audio segments from earlier time steps and not to the time\-aligned audio segment, weakening the Audio→\\toVideo direction\. The bottom panel groups audio and video into a single source \(V\+A↛\\not\\toQuestion and V\+A↛\\not\\toLast\)\. V\+A↛\\not\\toQuestion produces large mid\-layer drops across all tasks, while V\+A↛\\not\\toLast produces near\-zero changes, confirming that audio\-visual information reaches the prediction through the question rather than directly, in line with the Modalities→\\toQuestion→\\toLast route established in Section[4\.3](https://arxiv.org/html/2606.10147#S4.SS3)\.

#### F\.1\.2Further analysis of how audio\-visual information reaches the prediction via question components

![Refer to caption](https://arxiv.org/html/2606.10147v1/x11.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x12.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x13.png)

Figure 10:Component\-level question\-internal knockouts on Qwen2\.5\-Omni 3B\.Top:Source↛\\not\\toNonOptQ\.Middle:Source↛\\not\\toTrueOpt\.Bottom:Source↛\\not\\toLast\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.Section[4\.3](https://arxiv.org/html/2606.10147#S4.SS3)treats the question as a single aggregator\. Here we decompose it into three components, the correct\-option letter \(TrueOpt\), the incorrect\-option letters \(FalseOpt\), and the non\-option question text \(NonOptQ\), and trace how audio\-visual content reaches the prediction within the question\. Figure[10](https://arxiv.org/html/2606.10147#A6.F10)reveals two routes converging at the correct\-option letter\. The direct route, Modalities→\\toTrueOpt, sends modality content straight to the option letter\. The indirect route, Modalities→\\toNonOptQ→\\toTrueOpt, first passes through the non\-option question text before reaching the option letter\. Both routes are active on Visual Recognition, Speech Recognition, Speaker Recognition, and Speaker Detection, while Speech Attributes relies on the direct route only, with negligible flow through the non\-option question text\. The correct\-option letter therefore acts as the local aggregator, absorbing audio\-visual content from these routes at mid layers before being read by the last token at late layers\. The incorrect\-option letters and the non\-option question text never flow to the prediction directly, only through the correct\-option letter\. This component\-level view is consistent with previous work in VideoLLMsKimet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib6)\), which also reports the option token as the decisive integration point with both direct and indirect routes from video to the option\. Our analysis extends this observation to the audio\-visual setting, where the same dual\-route structure governs how both modalities reach the option letter\.

### F\.2WorldSense \(audio\-visual video\)

Figures[11](https://arxiv.org/html/2606.10147#A6.F11)and[12](https://arxiv.org/html/2606.10147#A6.F12)report the WorldSense knockout results\. Figure[11](https://arxiv.org/html/2606.10147#A6.F11)covers within\- and cross\-modal pathways together with modality and question pathways into the last token, while Figure[12](https://arxiv.org/html/2606.10147#A6.F12)covers the question\-internal pathways and the pathways into the correct option letter and the non\-option question text\.

![Refer to caption](https://arxiv.org/html/2606.10147v1/x14.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x15.png)

Figure 11:Qwen2\.5\-Omni 3B on WorldSense\.Knockout of within\- and cross\-modal pathways \(Cross\-frame, Cross\-audio segment, Audio↔\\leftrightarrowVideo\) and of modality and question pathways into the last token \(Video↛\\not\\toQuestion, Audio↛\\not\\toQuestion, Question↛\\not\\toLast, Video↛\\not\\toLast, Audio↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.![Refer to caption](https://arxiv.org/html/2606.10147v1/x16.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x17.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x18.png)

Figure 12:Qwen2\.5\-Omni 3B on WorldSense\.Modality and question pathways into the correct option letter \(Video↛\\not\\toTrueOpt, Audio↛\\not\\toTrueOpt, NonOptQ↛\\not\\toTrueOpt, V\+A↛\\not\\toTrueOpt\); modality pathways into the non\-option question text \(Video↛\\not\\toNonOptQ, Audio↛\\not\\toNonOptQ, V\+A↛\\not\\toNonOptQ\); and question\-internal pathways into the last token \(TrueOpt↛\\not\\toLast, FalseOpt↛\\not\\toLast, NonOptQ↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.
### F\.3AV\-Odyssey \(multi\-input audio\-visual interleaved\)

This subsection reports per\-task multi\-input interleaved knockouts on AV\-Odyssey\. The main paper reports results averaged across tasks, while Figures[13](https://arxiv.org/html/2606.10147#A6.F13)and[14](https://arxiv.org/html/2606.10147#A6.F14)break results down by individual task and input ordering \(I Ref→\\toA Cand and A Ref→\\toI Cand\) to show consistency across the benchmark\. Figure[13](https://arxiv.org/html/2606.10147#A6.F13)covers pathways into the Reference, the Question, and the last token, while Figure[14](https://arxiv.org/html/2606.10147#A6.F14)covers pathways into the correct option letter and pathways from the correct and incorrect candidates and option letters into the last token\.

![Refer to caption](https://arxiv.org/html/2606.10147v1/x19.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x20.png)

Figure 13:Qwen2\.5\-Omni 3B per\-task multi\-input knockout \(AV\-Odyssey\)\.Pathways into the Reference and Question \(Cross\-Candidate, Candidates↛\\not\\toReference, Question↛\\not\\toReference, Candidates↛\\not\\toQuestion\) and pathways into the last token \(Reference↛\\not\\toLast, Candidates↛\\not\\toLast, Question↛\\not\\toLast\)\. Each panel shows one task under one input ordering \(I Ref→\\toA Cand or A Ref→\\toI Cand\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.![Refer to caption](https://arxiv.org/html/2606.10147v1/x21.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x22.png)

Figure 14:Qwen2\.5\-Omni 3B per\-task multi\-input knockout \(AV\-Odyssey\)\.Pathways into the correct option letter \(CorrectCand↛\\not\\toCorrectOpt, IncorrectCand↛\\not\\toCorrectOpt, Reference↛\\not\\toCorrectOpt, Question↛\\not\\toCorrectOpt\) and finer\-grained pathways into the last token \(CorrectCand↛\\not\\toLast, IncorrectCand↛\\not\\toLast, CorrectOpt↛\\not\\toLast, IncorrectOpt↛\\not\\toLast\)\. Each panel shows one task under one input ordering \(I Ref→\\toA Cand or A Ref→\\toI Cand\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.

## Appendix GGeneralization to Qwen2\.5\-Omni 7B

This appendix reports knockout analyses for Qwen2\.5\-Omni 7B across all three datasets used in the paper, AV\-SpeakerBenchNguyenet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib1)\), WorldSenseHonget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib3)\), and AV\-OdysseyGonget al\.\([2024](https://arxiv.org/html/2606.10147#bib.bib2)\)\. The setup mirrors the analyses on Qwen2\.5\-Omni 3B in the main paper and Appendix[F](https://arxiv.org/html/2606.10147#A6), allowing a direct comparison across model scales\. For AV\-Odyssey, we report both the averaged knockouts \(matching the main paper’s reporting style\) and the per\-task knockouts for completeness\.

### G\.1AV\-SpeakerBench \(audio\-visual video\)

Figures[15](https://arxiv.org/html/2606.10147#A7.F15)and[16](https://arxiv.org/html/2606.10147#A7.F16)report the AV\-SpeakerBench knockout results\. Figure[15](https://arxiv.org/html/2606.10147#A7.F15)covers within\- and cross\-modal pathways together with modality and question pathways into the last token, while Figure[16](https://arxiv.org/html/2606.10147#A7.F16)covers the question\-internal pathways and the pathways into the correct option letter and the non\-option question text\.

![Refer to caption](https://arxiv.org/html/2606.10147v1/x23.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x24.png)

Figure 15:Qwen2\.5\-Omni 7B on AV\-SpeakerBench\.Knockout of within\- and cross\-modal pathways \(Cross\-frame, Cross\-audio segment, Audio↔\\leftrightarrowVideo\) and of modality and question pathways into the last token \(Video↛\\not\\toQuestion, Audio↛\\not\\toQuestion, Question↛\\not\\toLast, Video↛\\not\\toLast, Audio↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.![Refer to caption](https://arxiv.org/html/2606.10147v1/x25.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x26.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x27.png)

Figure 16:Qwen2\.5\-Omni 7B on AV\-SpeakerBench\.Modality and question pathways into the correct option letter \(Video↛\\not\\toTrueOpt, Audio↛\\not\\toTrueOpt, NonOptQ↛\\not\\toTrueOpt, V\+A↛\\not\\toTrueOpt\); modality pathways into the non\-option question text \(Video↛\\not\\toNonOptQ, Audio↛\\not\\toNonOptQ, V\+A↛\\not\\toNonOptQ\); and question\-internal pathways into the last token \(TrueOpt↛\\not\\toLast, FalseOpt↛\\not\\toLast, NonOptQ↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.
### G\.2WorldSense \(audio\-visual video\)

Figures[17](https://arxiv.org/html/2606.10147#A7.F17)and[18](https://arxiv.org/html/2606.10147#A7.F18)report the WorldSense knockout results\. Figure[17](https://arxiv.org/html/2606.10147#A7.F17)covers within\- and cross\-modal pathways together with modality and question pathways into the last token, while Figure[18](https://arxiv.org/html/2606.10147#A7.F18)covers the question\-internal pathways and the pathways into the correct option letter and the non\-option question text\.

![Refer to caption](https://arxiv.org/html/2606.10147v1/x28.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x29.png)

Figure 17:Qwen2\.5\-Omni 7B on WorldSense\.Knockout of within\- and cross\-modal pathways \(Cross\-frame, Cross\-audio segment, Audio↔\\leftrightarrowVideo\) and of modality and question pathways into the last token \(Video↛\\not\\toQuestion, Audio↛\\not\\toQuestion, Question↛\\not\\toLast, Video↛\\not\\toLast, Audio↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.![Refer to caption](https://arxiv.org/html/2606.10147v1/x30.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x31.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x32.png)

Figure 18:Qwen2\.5\-Omni 7B on WorldSense\.Modality and question pathways into the correct option letter \(Video↛\\not\\toTrueOpt, Audio↛\\not\\toTrueOpt, NonOptQ↛\\not\\toTrueOpt, V\+A↛\\not\\toTrueOpt\); modality pathways into the non\-option question text \(Video↛\\not\\toNonOptQ, Audio↛\\not\\toNonOptQ, V\+A↛\\not\\toNonOptQ\); and question\-internal pathways into the last token \(TrueOpt↛\\not\\toLast, FalseOpt↛\\not\\toLast, NonOptQ↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.
### G\.3AV\-Odyssey \(multi\-input audio\-visual interleaved\)

This subsection reports multi\-input knockouts on AV\-Odyssey for Qwen2\.5\-Omni 7B in two views, averaged across tasks \(Appendix[G\.3\.1](https://arxiv.org/html/2606.10147#A7.SS3.SSS1)\) and broken down per task \(Appendix[G\.3\.2](https://arxiv.org/html/2606.10147#A7.SS3.SSS2)\)\.

#### G\.3\.1Averaged Across Tasks

Figure[19](https://arxiv.org/html/2606.10147#A7.F19)reports the multi\-input knockout averaged across tasks, matching the reporting style of the main paper for Qwen2\.5\-Omni 3B\. The same set of pathways into the Reference, Question, correct option letter, and last token are shown\.

![Refer to caption](https://arxiv.org/html/2606.10147v1/x33.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x34.png)

Figure 19:Qwen2\.5\-Omni 7B multi\-input knockout \(AV\-Odyssey\), averaged across tasks\.Pathways into the Reference, Question, correct option letter, and last token, mirroring the analyses in the main paper for Qwen2\.5\-Omni 3B\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.
#### G\.3\.2Per\-task breakdown

Figures[20](https://arxiv.org/html/2606.10147#A7.F20)and[21](https://arxiv.org/html/2606.10147#A7.F21)break the same knockouts down by individual task and input ordering \(I Ref→\\toA Cand and A Ref→\\toI Cand\) for completeness\. Figure[20](https://arxiv.org/html/2606.10147#A7.F20)covers pathways into the Reference and Question and pathways into the last token, while Figure[21](https://arxiv.org/html/2606.10147#A7.F21)covers pathways into the correct option letter and the finer\-grained pathways into the last token\.

![Refer to caption](https://arxiv.org/html/2606.10147v1/x35.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x36.png)

Figure 20:Qwen2\.5\-Omni 7B per\-task multi\-input knockout \(AV\-Odyssey\)\.Pathways into the Reference and Question \(Cross\-Candidate, Candidates↛\\not\\toReference, Question↛\\not\\toReference, Candidates↛\\not\\toQuestion\) and pathways into the last token \(Reference↛\\not\\toLast, Candidates↛\\not\\toLast, Question↛\\not\\toLast\)\. Each panel shows one task under one input ordering \(I Ref→\\toA Cand or A Ref→\\toI Cand\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.![Refer to caption](https://arxiv.org/html/2606.10147v1/x37.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x38.png)

Figure 21:Qwen2\.5\-Omni 7B per\-task multi\-input knockout \(AV\-Odyssey\)\.Pathways into the correct option letter \(CorrectCand↛\\not\\toCorrectOpt, IncorrectCand↛\\not\\toCorrectOpt, Reference↛\\not\\toCorrectOpt, Question↛\\not\\toCorrectOpt\) and finer\-grained pathways into the last token \(CorrectCand↛\\not\\toLast, IncorrectCand↛\\not\\toLast, CorrectOpt↛\\not\\toLast, IncorrectOpt↛\\not\\toLast\)\. Each panel shows one task under one input ordering \(I Ref→\\toA Cand or A Ref→\\toI Cand\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.

## Appendix HGeneralization to Video\-SALMONN2 3B Plus

This appendix reports knockout analyses for Video\-SALMONN2 3B Plus on AV\-SpeakerBenchNguyenet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib1)\)and WorldSenseHonget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib3)\), mirroring the analyses on Qwen2\.5\-Omni 3B in the main paper and Appendix[F](https://arxiv.org/html/2606.10147#A6)\.

### H\.1AV\-SpeakerBench \(audio\-visual video\)

Figures[22](https://arxiv.org/html/2606.10147#A8.F22)and[23](https://arxiv.org/html/2606.10147#A8.F23)report the AV\-SpeakerBench knockout results\. Figure[22](https://arxiv.org/html/2606.10147#A8.F22)covers within\- and cross\-modal pathways together with modality and question pathways into the last token, while Figure[23](https://arxiv.org/html/2606.10147#A8.F23)covers the question\-internal pathways and the pathways into the correct option letter and the non\-option question text\.

![Refer to caption](https://arxiv.org/html/2606.10147v1/x39.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x40.png)

Figure 22:Video\-SALMONN2 3B Plus on AV\-SpeakerBench\.Knockout of within\- and cross\-modal pathways \(Cross\-frame, Cross\-audio segment, Audio↔\\leftrightarrowVideo\) and of modality and question pathways into the last token \(Video↛\\not\\toQuestion, Audio↛\\not\\toQuestion, Question↛\\not\\toLast, Video↛\\not\\toLast, Audio↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.![Refer to caption](https://arxiv.org/html/2606.10147v1/x41.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x42.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x43.png)

Figure 23:Video\-SALMONN2 3B Plus on AV\-SpeakerBench\.Modality and question pathways into the correct option letter \(Video↛\\not\\toTrueOpt, Audio↛\\not\\toTrueOpt, NonOptQ↛\\not\\toTrueOpt, V\+A↛\\not\\toTrueOpt\); modality pathways into the non\-option question text \(Video↛\\not\\toNonOptQ, Audio↛\\not\\toNonOptQ, V\+A↛\\not\\toNonOptQ\); and question\-internal pathways into the last token \(TrueOpt↛\\not\\toLast, FalseOpt↛\\not\\toLast, NonOptQ↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.
### H\.2WorldSense \(audio\-visual video\)

Figures[24](https://arxiv.org/html/2606.10147#A8.F24)and[25](https://arxiv.org/html/2606.10147#A8.F25)report the WorldSense knockout results\. Figure[24](https://arxiv.org/html/2606.10147#A8.F24)covers within\- and cross\-modal pathways together with modality and question pathways into the last token, while Figure[25](https://arxiv.org/html/2606.10147#A8.F25)covers the question\-internal pathways and the pathways into the correct option letter and the non\-option question text\.

![Refer to caption](https://arxiv.org/html/2606.10147v1/x44.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x45.png)

Figure 24:Video\-SALMONN2 3B Plus on WorldSense\.Knockout of within\- and cross\-modal pathways \(Cross\-frame, Cross\-audio segment, Audio↔\\leftrightarrowVideo\) and of modality and question pathways into the last token \(Video↛\\not\\toQuestion, Audio↛\\not\\toQuestion, Question↛\\not\\toLast, Video↛\\not\\toLast, Audio↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.![Refer to caption](https://arxiv.org/html/2606.10147v1/x46.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x47.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x48.png)

Figure 25:Video\-SALMONN2 3B Plus on WorldSense\.Modality and question pathways into the correct option letter \(Video↛\\not\\toTrueOpt, Audio↛\\not\\toTrueOpt, NonOptQ↛\\not\\toTrueOpt, V\+A↛\\not\\toTrueOpt\); modality pathways into the non\-option question text \(Video↛\\not\\toNonOptQ, Audio↛\\not\\toNonOptQ, V\+A↛\\not\\toNonOptQ\); and question\-internal pathways into the last token \(TrueOpt↛\\not\\toLast, FalseOpt↛\\not\\toLast, NonOptQ↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.

## Appendix IGeneralization to Video\-SALMONN2 7B Plus

This appendix reports knockout analyses for Video\-SALMONN2 7B Plus on AV\-SpeakerBenchNguyenet al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib1)\)and WorldSenseHonget al\.\([2025](https://arxiv.org/html/2606.10147#bib.bib3)\), mirroring the analyses on Qwen2\.5\-Omni 3B in the main paper and Appendix[F](https://arxiv.org/html/2606.10147#A6)\.

### I\.1AV\-SpeakerBench \(audio\-visual video\)

Figures[26](https://arxiv.org/html/2606.10147#A9.F26)and[27](https://arxiv.org/html/2606.10147#A9.F27)report the AV\-SpeakerBench knockout results\. Figure[26](https://arxiv.org/html/2606.10147#A9.F26)covers within\- and cross\-modal pathways together with modality and question pathways into the last token, while Figure[27](https://arxiv.org/html/2606.10147#A9.F27)covers the question\-internal pathways and the pathways into the correct option letter and the non\-option question text\.

![Refer to caption](https://arxiv.org/html/2606.10147v1/x49.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x50.png)

Figure 26:Video\-SALMONN2 7B Plus on AV\-SpeakerBench\.Knockout of within\- and cross\-modal pathways \(Cross\-frame, Cross\-audio segment, Audio↔\\leftrightarrowVideo\) and of modality and question pathways into the last token \(Video↛\\not\\toQuestion, Audio↛\\not\\toQuestion, Question↛\\not\\toLast, Video↛\\not\\toLast, Audio↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.![Refer to caption](https://arxiv.org/html/2606.10147v1/x51.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x52.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x53.png)

Figure 27:Video\-SALMONN2 7B Plus on AV\-SpeakerBench\.Modality and question pathways into the correct option letter \(Video↛\\not\\toTrueOpt, Audio↛\\not\\toTrueOpt, NonOptQ↛\\not\\toTrueOpt, V\+A↛\\not\\toTrueOpt\); modality pathways into the non\-option question text \(Video↛\\not\\toNonOptQ, Audio↛\\not\\toNonOptQ, V\+A↛\\not\\toNonOptQ\); and question\-internal pathways into the last token \(TrueOpt↛\\not\\toLast, FalseOpt↛\\not\\toLast, NonOptQ↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.
### I\.2WorldSense \(audio\-visual video\)

Figures[28](https://arxiv.org/html/2606.10147#A9.F28)and[29](https://arxiv.org/html/2606.10147#A9.F29)report the WorldSense knockout results\. Figure[28](https://arxiv.org/html/2606.10147#A9.F28)covers within\- and cross\-modal pathways together with modality and question pathways into the last token, while Figure[29](https://arxiv.org/html/2606.10147#A9.F29)covers the question\-internal pathways and the pathways into the correct option letter and the non\-option question text\.

![Refer to caption](https://arxiv.org/html/2606.10147v1/x54.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x55.png)

Figure 28:Video\-SALMONN2 7B Plus on WorldSense\.Knockout of within\- and cross\-modal pathways \(Cross\-frame, Cross\-audio segment, Audio↔\\leftrightarrowVideo\) and of modality and question pathways into the last token \(Video↛\\not\\toQuestion, Audio↛\\not\\toQuestion, Question↛\\not\\toLast, Video↛\\not\\toLast, Audio↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.![Refer to caption](https://arxiv.org/html/2606.10147v1/x56.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x57.png)

![Refer to caption](https://arxiv.org/html/2606.10147v1/x58.png)

Figure 29:Video\-SALMONN2 7B Plus on WorldSense\.Modality and question pathways into the correct option letter \(Video↛\\not\\toTrueOpt, Audio↛\\not\\toTrueOpt, NonOptQ↛\\not\\toTrueOpt, V\+A↛\\not\\toTrueOpt\); modality pathways into the non\-option question text \(Video↛\\not\\toNonOptQ, Audio↛\\not\\toNonOptQ, V\+A↛\\not\\toNonOptQ\); and question\-internal pathways into the last token \(TrueOpt↛\\not\\toLast, FalseOpt↛\\not\\toLast, NonOptQ↛\\not\\toLast\)\. Source↛\\not\\toTarget indicates blocking attention edges from source tokens to target tokens\.
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Similar Articles

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

When Vision Speaks for Sound

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Large Vision-Language Models Get Lost in Attention

Submit Feedback

Similar Articles

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
Large Vision-Language Models Get Lost in Attention