AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

arXiv cs.CL 06/24/26, 04:00 AM Papers
audio-video token-compression multimodal long-context retrieval large-language-models
Summary
AVOC introduces a retrieval-inspired token compression method for omni-modal LLMs that effectively handles hour-long audio-video inputs by selecting informative tokens based on relevance, importance, and diversity. The framework achieves state-of-the-art results on long-form audio-video understanding benchmarks, surpassing prior methods by significant margins.
arXiv:2606.24286v1 Announce Type: new Abstract: Multimodal Large Language Models have achieved remarkable progress in short-form audio-video understanding, yet long-form audio-video comprehension remains challenged by limited context windows and severe information redundancy. To address these bottlenecks, we propose AVOC, a framework for long-form audio-video understanding in Omni-modal Large Language Models. AVOC introduces a learnable token compression module between the modality encoders and the LLM backbone. We reframe multimodal token compression as a top-$K$ retrieval problem: given a fixed context budget, the module must retrieve a compact subset of tokens that best supports answering the user query. We draw inspiration from three classical Information Retrieval criteria for selecting informative units from a large candidate pool: relevance, importance, and diversity. AVOC instantiates each criterion as a tailored mechanism for audio-video understanding, and integrates them into a unified retrieval-style compression pipeline. Experiments show that AVOC achieves state-of-the-art performance on long-form audio-video benchmarks, surpassing the second-best model by 4.9 and 5.5 points in average accuracy on OmniVideoBench and LVOmniBench, respectively. Moreover, AVOC maintains robust performance on Audio-Video Needle-in-a-Haystack task at durations up to one hour.
Original Article
View Cached Full Text
Cached at: 06/24/26, 07:46 AM
# AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression
Source: [https://arxiv.org/html/2606.24286](https://arxiv.org/html/2606.24286)
Yijing Chen1Wenhui Tan1Xiaoyi Yu1Yuyue Wang1Xin Cheng1Kaisi Guan1 Hao Jiang2Xiangyang Li2Guojie Zhu2Ruihua Song1 1Gaoling School of Artificial Intelligence, Renmin University of China2Huawei Inc\.

###### Abstract

Multimodal Large Language Models have achieved remarkable progress in short\-form audio\-video understanding, yet long\-form audio\-video comprehension remains challenged by limited context windows and severe information redundancy\. To address these bottlenecks, we propose AVOC, a framework for long\-form audio\-video understanding in Omni\-modal Large Language Models\. AVOC introduces a learnable token compression module between the modality encoders and the LLM backbone\. We reframe multimodal token compression as a top\-KKretrieval problem: given a fixed context budget, the module must retrieve a compact subset of tokens that best supports answering the user query\. We draw inspiration from three classical Information Retrieval criteria for selecting informative units from a large candidate pool:*relevance*,*importance*, and*diversity*\. AVOC instantiates each criterion as a tailored mechanism for audio\-video understanding, and integrates them into a unified retrieval\-style compression pipeline\. Experiments show that AVOC achieves state\-of\-the\-art performance on long\-form audio\-video benchmarks, surpassing the second\-best model by 4\.9 and 5\.5 points in average accuracy on OmniVideoBench and LVOmniBench, respectively\. Moreover, AVOC maintains robust performance on Audio\-Video Needle\-in\-a\-Haystack task at durations up to one hour\.

## 1Introduction

Multimodal Large Language Models \(MLLMs\)Xuet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib22)\); Team \([2025](https://arxiv.org/html/2606.24286#bib.bib34)\); Chenget al\.\([2024](https://arxiv.org/html/2606.24286#bib.bib37)\); Tanget al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib38)\); Cuiet al\.\([2026](https://arxiv.org/html/2606.24286#bib.bib26)\)have made remarkable progress in bridging vision, audio, and natural language\. By integrating visual and audio encoders with large language models, existing methods perform well on short\-form audio\-video tasks such as audio\-video question answering, video and audio captioning, and multimodal dialogueChaoet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib39)\); Zhouet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib40)\); Liet al\.\([2024b](https://arxiv.org/html/2606.24286#bib.bib41),[2022](https://arxiv.org/html/2606.24286#bib.bib42)\)\. However, real\-world multimodal information \(e\.g\., movies, meeting recordings, and tutorials\) typically spans extremely long durations\. This requires models not only to comprehend short\-form events, but also to reason over and localize key information within hour\-level audio\-video contexts\.

Despite the strong demand, endowing models with hour\-level audio\-video understanding capabilities still faces severe challenges\. On the one hand, the limited context window of MLLMs cannot directly accommodate the massive token sequences produced by extremely long audio\-video streams\. On the other hand, raw audio\-video streams exhibit substantial information redundancy, which not only wastes the precious context budget but also dilutes critical cues, degrading the model’s understanding quality over long sequences\. As illustrated in Figure[1](https://arxiv.org/html/2606.24286#S1.F1), existing context\-reduction strategies fall short on extremely long audio\-video content\. Content\-agnostic sampling faces a fundamental trade\-off: sparse sampling misses critical short\-lived events, while dense sampling rapidly exhausts the context window, leading to severe sequence truncation\. Recent omni\-modal compression methodsTaoet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib19)\); Dinget al\.\([2026](https://arxiv.org/html/2606.24286#bib.bib21)\)address this gap, but they usually adopt rigid asymmetric designs, where one modality drives compression of the other\. As a result, important events may be discarded when the guiding modality provides weak or sparse signals\.

To address the above problems, we propose a new framework called AVOC \(Enhancing Hour\-LevelAudio\-Video Understanding inOmni\-Modal LLMs via Retrieval\-Inspired TokenCompression\)\. Our starting point is to reframe multimodal token compression as a*top\-KKretrieval problem*: given a fixed context budget and a large pool of candidate tokens, the model must retrieve a compact subset which best supports answering the user query\. This reformulation allows us to leverage classical Information Retrieval \(IR\) principles for selecting informative units under limited capacity budgets\. Among the criteria that IR has long developed for ranking and selecting informative units, three are particularly relevant to our setting: query\-conditioned*relevance*, which prioritizes units pertinent to the user queryRobertson and Zaragoza \([2009](https://arxiv.org/html/2606.24286#bib.bib54)\); Karpukhinet al\.\([2020](https://arxiv.org/html/2606.24286#bib.bib55)\); query\-agnostic*importance*, which captures intrinsic informativeness independent of any specific queryPageet al\.\([1999](https://arxiv.org/html/2606.24286#bib.bib53)\); and result*diversity*, which penalizes redundancy among the selected unitsCarbonell and Goldstein \([1998](https://arxiv.org/html/2606.24286#bib.bib46)\); Clarkeet al\.\([2008](https://arxiv.org/html/2606.24286#bib.bib50)\)\. AVOC adapts these IR principles to long audio\-video understanding via a learnable compression module that realizes each criterion with a tailored mechanism\.*Relevance*is computed via text\-guided cross\-attention that conditions per\-token scores on the user query\.*Importance*is computed via bidirectional video\-audio cross\-attention within each temporal block, providing a query\-agnostic signal that complements relevance when the textual query is sparse\.*Diversity*is enforced through Temporal\-Aware Maximal Marginal Relevance, which penalizes similarities within a local temporal window, suppressing redundant adjacent tokens while preserving recurring events that are temporally distant\. Together, these three mechanisms yield a compact, informative token sequence under a tight context budget\.

![Refer to caption](https://arxiv.org/html/2606.24286v1/x1.png)Figure 1:Comparison of context\-reduction strategies for long\-form audio\-video understanding\.The main contributions of this paper can be summarized as follows:

- •From a new perspective of multimodal token compression as a top\-KKretrieval problem over multimodal tokens, we design a learnable compression module that instantiates three classical IR criteria with tailored mechanisms: text\-guided cross\-attention for query\-conditioned*relevance*, bidirectional video\-audio cross\-attention within each temporal block for query\-agnostic*importance*, and Temporal\-Aware Maximal Marginal Relevance Selecting for local*diversity*\.
- •Built upon this compression module, we develop AVOC, an omni\-modal large language model capable of processing hour\-level audio\-video streams, achieving both holistic comprehension and fine\-grained retrieval over ultra\-long multimodal content under a tight context budget\.
- •Extensive experiments demonstrate that AVOC achieves state\-of\-the\-art performance on multiple long\-form audio\-video understanding benchmarks, surpassing the second\-best method by 4\.9 and 5\.5 points in average accuracy on OmniVideoBench and LVOmniBench, respectively, and maintains robust accuracy on Audio\-Video Needle\-in\-a\-Haystack task at durations up to one hour\.

## 2Related Work

#### Long\-Form Video Understanding in Vision Large Language Models\.

Recent years have witnessed significant progress in extending Vision\-Language Models \(VLMs\) to long\-form video understandingSonget al\.\([2024](https://arxiv.org/html/2606.24286#bib.bib12)\); Zhanget al\.\([2025a](https://arxiv.org/html/2606.24286#bib.bib2)\); Chenet al\.\([2025c](https://arxiv.org/html/2606.24286#bib.bib4)\); Tanet al\.\([2026](https://arxiv.org/html/2606.24286#bib.bib43)\); Shuet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib18)\); Liet al\.\([2025b](https://arxiv.org/html/2606.24286#bib.bib13)\)\. One line of research focuses on context window extension to ingest full token sequencesLiuet al\.\([2025a](https://arxiv.org/html/2606.24286#bib.bib1)\); Zhanget al\.\([2025a](https://arxiv.org/html/2606.24286#bib.bib2)\); Chenet al\.\([2025c](https://arxiv.org/html/2606.24286#bib.bib4),[b](https://arxiv.org/html/2606.24286#bib.bib5)\); Wei and Chen \([2025](https://arxiv.org/html/2606.24286#bib.bib3)\), though this kind of approach is computationally prohibitive at long sequence lengths and fails to address the heavy information redundancy in video data\. To reduce computational cost and redundancy, numerous compression\-based methods have emerged\. These methods generally fall into four underlying mechanismsShaoet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib6)\): transformation\-based approaches that employ spatial or temporal poolingMaazet al\.\([2024](https://arxiv.org/html/2606.24286#bib.bib8)\); Wenget al\.\([2024](https://arxiv.org/html/2606.24286#bib.bib9)\); similarity\-based techniques that group and merge redundant tokens across consecutive framesJinet al\.\([2024](https://arxiv.org/html/2606.24286#bib.bib10)\); Liet al\.\([2025b](https://arxiv.org/html/2606.24286#bib.bib13)\); Shenet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib11)\); Songet al\.\([2024](https://arxiv.org/html/2606.24286#bib.bib12)\); attention\-based methods that prune tokens based on attention sparsityChenet al\.\([2024](https://arxiv.org/html/2606.24286#bib.bib14)\); Yanget al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib16)\); Zhanget al\.\([2026](https://arxiv.org/html/2606.24286#bib.bib15),[2025b](https://arxiv.org/html/2606.24286#bib.bib17)\); and query\-based strategies that utilize token distillation via dynamic memory banks or cross\-modal token selectionSonget al\.\([2024](https://arxiv.org/html/2606.24286#bib.bib12)\); Shuet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib18)\); Shenet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib11)\); Liet al\.\([2024a](https://arxiv.org/html/2606.24286#bib.bib7)\)\. Despite these advancements, current methodologies largely ignore the accompanying audio stream\. In real\-world multimodal content, such as movies, tutorials, and meetings, auditory signals carry irreplaceable semantic context\. By remaining strictly reliant on compressed visual cues, existing long\-video VLMs inevitably suffer from incomplete semantic comprehension, overlooking critical auditory information such as speech, environmental sounds, and music that is essential for holistic understanding\.

#### Unified Audio\-Video Understanding in Omni\-Modal Large Language Models\.

To overcome the visual\-centric limitations of VLMs, some recent research has shifted toward the development of Omni\-Modal Large Language Models \(OLLMs\) capable of unified audio\-video understanding\. To compress the massive information generated by high\-resolution video and continuous high\-sampling\-rate audio into a limited context window, initial OLLMs predominantly relied on content\-agnostic operations such as sparse temporal subsampling, basic average pooling, or naive sequence truncationXuet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib22)\); Liuet al\.\([2025b](https://arxiv.org/html/2606.24286#bib.bib24)\); Yeet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib23)\)\. Lacking content awareness and offering limited compression ratios, these methods fail to enable models to comprehend extremely long audio\-video content\. To address these bottlenecks, recent studies have introduced dynamic token compression strategies to optimize context window utilization\. OmniZipTaoet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib19)\)utilizes salient audio tokens to capture information density and guide the pruning rate of corresponding video tokens\. Conversely, OmniSIFTDinget al\.\([2026](https://arxiv.org/html/2606.24286#bib.bib21)\)indicates that human perception is visually anchored; it first prunes spatio\-temporal video redundancy and then utilizes the resulting visual anchors to select informative audio tokens\. Both OmniZipTaoet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib19)\)and OmniSIFTDinget al\.\([2026](https://arxiv.org/html/2606.24286#bib.bib21)\)rely on a unidirectional dependency in which either video or audio serves as the dominant modality that drives the compression of the other\. This risks destruction of key information when the dominant modality experiences sparsity\. These gaps highlight the necessity for a symmetric and adaptive compression architecture that better models the relations across modalities and maximizes information density within the context window without restrictive asymmetric biases\.

## 3Methodology

![Refer to caption](https://arxiv.org/html/2606.24286v1/x2.png)Figure 2:Overview of AVOC\. The compression module condenses the interleaved video\-audio token sequence into a compact subset before passing it to the LLM, guided by three retrieval\-inspired criteria: relevance, importance and diversity\.To enable hour\-level audio\-video understanding in OLLMs, we introduce a dynamic compression module that jointly condenses continuous visual and auditory streams into a compact sequence of highly informative representations\. As illustrated in Figure[2](https://arxiv.org/html/2606.24286#S3.F2), this module is strategically positioned between the modality encoding stage and the large language model backbone\.

### 3\.1Problem Formulation and a New Perspective

Following standard practice in OLLMsXuet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib22)\); Team \([2025](https://arxiv.org/html/2606.24286#bib.bib34)\); Cuiet al\.\([2026](https://arxiv.org/html/2606.24286#bib.bib26)\); Yeet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib23)\), the video and audio streams are first encoded separately and grouped into temporal blocks with equal time duration, where each block concatenates the video and audio tokens from the same time window\. The blocks are then arranged sequentially to form a unified multimodal token sequence\. LetX=\{x1,x2,…,xN\}X=\\\{x\_\{1\},x\_\{2\},\\dots,x\_\{N\}\\\}denote the full sequence of interleaved multimodal tokens, where each tokenxix\_\{i\}carries a temporal block indexτi\\tau\_\{i\}and a modality labelmi∈\{V,A\}m\_\{i\}\\in\\\{V,A\\\}\. Given a text queryTTand a fixed token budgetK<NK<N, the goal of our compression module is to select a compact subsetS⊂XS\\subset Xof sizeKKthat best preserves the information needed for downstream reasoning\.

This problem can be cast as a top\-KKretrieval problem over multimodal tokens: each token plays the role of a candidate unit in a retrieval corpus, the text queryTTserves as the search query, and the budget constraint\|S\|=K\|S\|=Kcorresponds to retrieving the top\-KKunits that best support answering the query\. This IR perspective lets us inherit several design principles that information retrieval has long developed for selecting informative units from a large candidate pool under capacity constraints\. In particular, we adopt three well\-established criteria from IR: query\-conditioned*relevance*that prioritizes units pertinent to the user query, forming the basis of ranking from classical lexical matching to learned neural rankersRobertson and Zaragoza \([2009](https://arxiv.org/html/2606.24286#bib.bib54)\); Karpukhinet al\.\([2020](https://arxiv.org/html/2606.24286#bib.bib55)\); query\-agnostic*importance*that captures intrinsic informativeness independent of any specific query, as exemplified by graph\-centrality\-style scoresPageet al\.\([1999](https://arxiv.org/html/2606.24286#bib.bib53)\); and result*diversity*that penalizes redundancy among the selected units, as exemplified by Maximal Marginal Relevance and related re\-ranking schemesCarbonell and Goldstein \([1998](https://arxiv.org/html/2606.24286#bib.bib46)\); Clarkeet al\.\([2008](https://arxiv.org/html/2606.24286#bib.bib50)\)\.

Adapting these IR principles to the long audio\-video setting, we identify three criteria that an informative retrieved token subset should satisfy: \(i\)*relevance*—retrieved tokens should carry information pertinent to the user query; \(ii\)*importance*—retrieved tokens should additionally reflect query\-agnostic informativeness, complementing relevance when the textual query is sparse relative to the rich audio\-video content; and \(iii\)*diversity*—retrieved tokens should contribute minimally overlapping information, so that each occupies the limited context budget with a distinct contribution\.

Guided by these three criteria, we design a learnable, retrieval\-style compression module that instantiates each axis with a tailored mechanism, detailed in the following subsections: text\-guided cross\-attention scoring forrelevance\([3\.2](https://arxiv.org/html/2606.24286#S3.SS2)\), bidirectional video\-audio cross\-attention scoring forimportance\([3\.3](https://arxiv.org/html/2606.24286#S3.SS3)\), and Temporal\-Aware Maximal Marginal Relevance selection fordiversity\([3\.4](https://arxiv.org/html/2606.24286#S3.SS4)\)\. The first two stages produce per\-token scores that play the role of a learned retrieval scorer, and the third stage performs a temporally aware diversity re\-ranking over these scores, mirroring the common scoring\-and\-reranking pattern in IR\.

### 3\.2Relevance: Text\-Guided Cross\-Attention Scoring

In the spirit of query\-document scoring in IR, the text\-guided cross\-attention module treats the text query as the search query and the multimodal tokens as the candidate corpus, and computes a per\-token relevance score that conditions token selection on the user query\.

LetEva∈ℝN×dE\_\{\\text\{va\}\}\\in\\mathbb\{R\}^\{N\\times d\}denote the embedding matrix of all multimodal tokens inXX, and letEtext∈ℝNtext×dE\_\{\\text\{text\}\}\\in\\mathbb\{R\}^\{N\_\{\\text\{text\}\}\\times d\}denote the embeddings of the text queryTT\. We project them into query and key spaces using two learnable projection matricesWqrel,Wkrel∈ℝd×dW^\{\\text\{rel\}\}\_\{q\},W^\{\\text\{rel\}\}\_\{k\}\\in\\mathbb\{R\}^\{d\\times d\}:

Qtext=EtextWqrel,Kva=EvaWkrel\.Q\_\{\\text\{text\}\}=E\_\{\\text\{text\}\}W^\{\\text\{rel\}\}\_\{q\},\\quad K\_\{\\text\{va\}\}=E\_\{\\text\{va\}\}W^\{\\text\{rel\}\}\_\{k\}\.\(1\)We compute the cross\-attention scores between the text queries and the multimodal tokens using scaled dot\-product attention:

Arel=Qtext⋅KvaTd,A^\{\\text\{rel\}\}=\\frac\{Q\_\{\\text\{text\}\}\\cdot K\_\{\\text\{va\}\}^\{T\}\}\{\\sqrt\{d\}\},\(2\)whereddis the hidden dimension size\. To determine the overall relevance of each multimodal tokenxix\_\{i\}, we average the attention logits received from all textual tokensjj:

scorerel\(xi\)=1Ntext∑jAj,irel\.\\mathrm\{score\}\_\{\\text\{rel\}\}\(x\_\{i\}\)=\\frac\{1\}\{N\_\{\\text\{text\}\}\}\\sum\_\{j\}A^\{\\text\{rel\}\}\_\{j,i\}\.\(3\)

### 3\.3Importance: Video\-Audio Cross\-Attention Scoring

While text\-guided scoring captures the relation between multimodal tokens and the user query, the textual context is often sparse relative to the rich audio\-video content of long\-form videos: complex reasoning frequently depends on multimodal cues that are not explicitly mentioned in the query\. To complement query\-conditioned relevance, we additionally compute a query\-agnostic*importance*score that estimates intrinsic informativeness, in the spirit of query\-independent document priors in IR such as centrality\- or popularity\-based scoresPageet al\.\([1999](https://arxiv.org/html/2606.24286#bib.bib53)\)\. Concretely, we use bidirectional cross\-modal attention within each temporal block as a learnable proxy for this informativeness: tokens that interact more strongly with the opposing modality receive higher importance scores\.

For each temporal block, letEm∈ℝNm×dE\_\{m\}\\in\\mathbb\{R\}^\{N\_\{m\}\\times d\}denote the embeddings of modalitym∈\{V,A\}m\\in\\\{V,A\\\}within that block, and letm¯\\bar\{m\}denote its opposing modality \(i\.e\.,V¯=A\\bar\{V\}=A,A¯=V\\bar\{A\}=V\)\. We compute bidirectional cross\-attention between the two modalities to capture cross\-modal interaction\.

Specifically, we first project the embeddings into query and key spaces using learnable projection matricesWqimp,Wkimp∈ℝd×dW^\{\\text\{imp\}\}\_\{q\},W^\{\\text\{imp\}\}\_\{k\}\\in\\mathbb\{R\}^\{d\\times d\}:

Qm=EmWqimp,Km=EmWkimp,m∈\{V,A\}\.Q\_\{m\}=E\_\{m\}W^\{\\text\{imp\}\}\_\{q\},\\quad K\_\{m\}=E\_\{m\}W^\{\\text\{imp\}\}\_\{k\},\\quad m\\in\\\{V,A\\\}\.\(4\)The bidirectional cross\-attention matrix from modalitym¯\\bar\{m\}tommis then computed as:

Am¯m=Qm¯Km⊤d∈ℝNm¯×Nm\.A\_\{\\bar\{m\}m\}=\\frac\{Q\_\{\\bar\{m\}\}K\_\{m\}^\{\\top\}\}\{\\sqrt\{d\}\}\\in\\mathbb\{R\}^\{N\_\{\\bar\{m\}\}\\times N\_\{m\}\}\.\(5\)We obtain a token\-level importance score by averaging, for each tokenxix\_\{i\}, the attention logits it receives from all tokensxjx\_\{j\}of the opposing modality within the same temporal block:

scoreimp\(xi\)=1Nm¯i∑j\(Am¯imi\)j,i\.\\mathrm\{score\}\_\{\\text\{imp\}\}\(x\_\{i\}\)=\\frac\{1\}\{N\_\{\\bar\{m\}\_\{i\}\}\}\\sum\_\{j\}\(A\_\{\\bar\{m\}\_\{i\}m\_\{i\}\}\)\_\{j,i\}\.\(6\)
With per\-token relevance and importance scores in hand, we fuse them into a combined score that drives the subsequent selection\. Sincescorerel\(xi\)\\mathrm\{score\}\_\{\\text\{rel\}\}\(x\_\{i\}\)andscoreimp\(xi\)\\mathrm\{score\}\_\{\\text\{imp\}\}\(x\_\{i\}\)arise from different attention mechanisms and modalities, their raw magnitudes are not directly comparable\. To place them on a common scale, we apply Z\-score normalization within each scoring method and modality, yieldingscorerel′\(xi\)\\mathrm\{score\}^\{\\prime\}\_\{\\text\{rel\}\}\(x\_\{i\}\)andscoreimp′\(xi\)\\mathrm\{score\}^\{\\prime\}\_\{\\text\{imp\}\}\(x\_\{i\}\), and define the combined per\-token score as their average:

score\(xi\)=12\(scorerel′\(xi\)\+scoreimp′\(xi\)\)\.\\mathrm\{score\}\(x\_\{i\}\)=\\frac\{1\}\{2\}\\left\(\\mathrm\{score\}^\{\\prime\}\_\{\\text\{rel\}\}\(x\_\{i\}\)\+\\mathrm\{score\}^\{\\prime\}\_\{\\text\{imp\}\}\(x\_\{i\}\)\\right\)\.\(7\)

### 3\.4Diversity: Temporal\-Aware Maximal Marginal Relevance Selecting

Retrieving tokens purely byscore\(xi\)\\mathrm\{score\}\(x\_\{i\}\)often yields severe redundancy, since the temporal continuity of natural audio\-video streams causes high\-scoring tokens to cluster within adjacent, highly similar segments—an issue analogous to the redundancy problem in top\-KKretrieval results\. Inspired by*result diversification*in IR, we therefore add a diversity\-aware re\-ranking stage on top of the per\-token scores to ensure that the limited token budget is not wasted on redundant tokens\.

A natural choice is to greedily select tokens via the Maximal Marginal Relevance \(MMR\) objectiveCarbonell and Goldstein \([1998](https://arxiv.org/html/2606.24286#bib.bib46)\), a classical diversification method in information retrieval:

MMR\(xi\)=\(1−λ\)⋅score\(xi\)−λ⋅maxxi′∈Sselect⁡sim\(xi,xi′\),\\text\{MMR\}\(x\_\{i\}\)=\(1\-\\lambda\)\\cdot\\mathrm\{score\}\(x\_\{i\}\)\-\\lambda\\cdot\\max\_\{x\_\{i^\{\\prime\}\}\\in S\_\{\\text\{select\}\}\}\\mathrm\{sim\}\(x\_\{i\},x\_\{i^\{\\prime\}\}\),\(8\)whereSselectS\_\{\\text\{select\}\}denotes the set of tokens already selected in previous iterations, andλ∈\[0,1\]\\lambda\\in\[0,1\]balancesscore\(xi\)\\mathrm\{score\}\(x\_\{i\}\)against redundancy withSselectS\_\{\\text\{select\}\}\.

However, conventional MMR is designed in a time\-agnostic manner\. Applying MMR to long\-form audio\-video introduces significant bias; for instance, it would incorrectly suppress a semantically similar but temporally distinct event occurring at the end of the video simply because a similar event happened at the beginning\. Therefore, we propose Temporal\-Aware MMR \(TA\-MMR\)\. In contrast to MMR, TA\-MMR constrains the novelty calculation to a local temporal window\. For a candidate tokenxix\_\{i\}at temporal indexτi\\tau\_\{i\}, the objective function is formulated as:

TA\-MMR\(xi\)=\(1−λ\)⋅score\(xi\)−λ⋅maxxi′∈Sselect∩Window\(τi\)⁡sim\(xi,xi′\),\\text\{TA\-MMR\}\(x\_\{i\}\)=\(1\-\\lambda\)\\cdot\\mathrm\{score\}\(x\_\{i\}\)\-\\lambda\\cdot\\max\_\{\{\\color\[rgb\]\{0,0,1\}x\_\{i^\{\\prime\}\}\\in S\_\{\\text\{select\}\}\\cap\\text\{Window\}\(\\tau\_\{i\}\)\}\}\\mathrm\{sim\}\(x\_\{i\},x\_\{i^\{\\prime\}\}\),\(9\)whereWindow\(τi\)=\[τi−W,τi\+W\]\\text\{Window\}\(\\tau\_\{i\}\)=\[\\tau\_\{i\}\-W,\\tau\_\{i\}\+W\]defines the local temporal scope centered atτi\\tau\_\{i\}with radiusWW, andsim\(⋅,⋅\)\\mathrm\{sim\}\(\\cdot,\\cdot\)denotes the mean\-centered cosine similarity between token representations\. The similarity is restricted to tokens of the same modality, as features from different modalities reside in heterogeneous spaces, where directly computing similarity would yield unreliable redundancy estimates\. By only penalizing similarities within the local context, TA\-MMR suppresses informationally repetitive adjacent tokens while preserving similar but temporally distinct events across the hour\-level duration\.

Given a total budgetKK, we further split it into a modality\-aware budget\(Kvideo,Kaudio\)\(K\_\{\\text\{video\}\},K\_\{\\text\{audio\}\}\)withKvideo\+Kaudio=KK\_\{\\text\{video\}\}\+K\_\{\\text\{audio\}\}=K, and perform TA\-MMR selection independently within each modality\. This decoupled allocation prevents either modality from dominating the context budget and ensures a balanced cross\-modal representation\. For each modalitym∈\{V,A\}m\\in\\\{V,A\\\}with budgetKmK\_\{m\}, we iteratively select the token maximizing the TA\-MMR objective and add it toSselectS\_\{\\text\{select\}\}, untilKmK\_\{m\}tokens of that modality have been chosen\. The selected tokens from both modalities together form the final setSS, which is re\-ordered by temporal indexτ\\taubefore being passed to the LLM\.

## 4Experiments

### 4\.1Implementation details\.

#### Model Configuration and Training\.

We build our model upon MiniCPM\-o 4\.5Cuiet al\.\([2026](https://arxiv.org/html/2606.24286#bib.bib26)\), initializing the architecture with pre\-trained MiniCPM\-o 4\.5 checkpoints, with the compression module initialized randomly\. The model is trained on 40k samples drawn from datasets including AVSDAlAmriet al\.\([2019](https://arxiv.org/html/2606.24286#bib.bib56)\), How2Sanabriaet al\.\([2018](https://arxiv.org/html/2606.24286#bib.bib28)\), FineVideoFarréet al\.\([2024](https://arxiv.org/html/2606.24286#bib.bib30)\), ChronusAVChenet al\.\([2025a](https://arxiv.org/html/2606.24286#bib.bib29)\), and LongVILA\_sftChenet al\.\([2025c](https://arxiv.org/html/2606.24286#bib.bib4)\)\. This diverse collection encompasses a wide range of tasks, including audio\-video speech recognition, video and audio captioning, and audio\-video question answering\. For video preprocessing, we sample 1 frame per second \(FPS\) for videos up to 320 seconds, and uniformly sample 320 frames for longer videos\.

The training process is divided into two stages\. In Stage 1, we disable the compression module and fine\-tune only the LLM on 20k samples for one epoch\. In Stage 2, we activate the randomly initialized compression module and jointly train it with the LLM using the remaining 20k samples for one epoch\. The vision encoder, audio encoder, and adapter remain frozen throughout\. This two\-stage training approach aligns the LLM with the training data distribution in Stage 1, ensuring it provides a stable, high\-quality gradient signal to the compression module in Stage 2\. For the compression module, we use a learning rate of5×10−55\\times 10^\{\-5\}, whereas the LLM is fine\-tuned with a more conservative rate of5×10−65\\times 10^\{\-6\}\. To enable the model to adapt to different compression ratios, the token retention ratio is randomly sampled from a range of 0\.1 to 1\.0 during each Stage 2 training iteration\.

#### Differentiable Top\-kkvia Gumbel\-Softmax\.

A key challenge in training the compression module is the non\-differentiability of theKK\-selection process\. To enable end\-to\-end gradient propagation from the next\-token prediction loss back to the projection layers of the compression module, we implement a differentiable top\-KKselection strategy based on Gumbel\-SoftmaxJanget al\.\([2017](https://arxiv.org/html/2606.24286#bib.bib33)\)\. We utilize a Straight\-Through Estimator during the forward pass: a hardKK\-hot mask is generated to select the discrete set of informative tokens for the subsequent LLM processing\. In the backward pass, gradients are propagated through the continuous Gumbel\-Softmax relaxations, bypassing the non\-differentiable selection\. We set the Gumbel\-Softmax temperature to1\.01\.0to maintain a balance between sampling exploration and selection accuracy\. Note that TA\-MMR is disabled during training and only activated at inference, as its greedy iterative selection is incompatible with parallel differentiable top\-KK\.

### 4\.2General Long\-Form Audio\-Video Understanding Evaluation

Table 1:Performance comparison on long\-form audio\-video understanding benchmarks\. All compared models use LLM backbones at the 7–8B scale\. All results are reported as accuracy \(%\)\. The best and second\-best results are marked inboldandunderlined, respectively\. The last row reports the absolute improvement of AVOC over the second\-best result in each column\.#### Evaluation Settings\.

We evaluate AVOC on three long\-form audio\-video benchmarks: WorldSenseHonget al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib49)\), OmniVideoBenchLiet al\.\([2025a](https://arxiv.org/html/2606.24286#bib.bib47)\), and LVOmniBenchTaoet al\.\([2026](https://arxiv.org/html/2606.24286#bib.bib48)\)\. WorldSense \(up to 10 min\) covers diverse real\-world scenarios, and we report its average accuracy\. OmniVideoBench \(up to 30 min\) emphasizes audio\-video reasoning with strong modality complementarity; we report results on its ultralong "\(10, 30\] min" subset as well as the overall average\. LVOmniBench \(10–90 min\) stratifies questions into Low, Medium, and High difficulty tiers according to factors such as average video duration and information granularity; we report accuracy on the Medium and High subsets together with the average across all tiers\.

When evaluating AVOC, we follow the same video preprocessing pipeline used during training: videos are sampled at 1 FPS for those shorter than 320 seconds and uniformly sampled 320 frames for longer videos, while the accompanying audio stream is input in full\. We activate the compression module, and adopt a fixed global token budget ofK=10240K=10240, with a modality token budget allocation ratio ofKvideo:Kaudio=2:1K\_\{\\text\{video\}\}:K\_\{\\text\{audio\}\}=2\{:\}1, matching the video\-heavy information density in the target benchmarks\. The TA\-MMR diversity weightλ=0\.15\\lambda=0\.15and the local temporal window radiusW=3W=3\. For the baselines, we use the official configurations for each model and evaluate using the maximum permissible number of frames and audio length\.

#### Performance\.

As shown in Table[1](https://arxiv.org/html/2606.24286#S4.T1), AVOC consistently achieves state\-of\-the\-art performance across all three benchmarks, with absolute gains of 1\.7–7\.2 points over the second\-best results\. Two observations are worth highlighting\. First, AVOC’s advantage scales with video duration: compared to the relatively shorter WorldSense \(\+1\.7\), AVOC achieves larger gains on the much longer OmniVideoBench \(\+4\.9 on Avg\., \+5\.2 on \(10, 30\] min subset\) and LVOmniBench \(\+5\.5 on Avg\., up to \+7\.2 on medium difficulty subset\)\. This confirms that our compression module is particularly effective for ultra\-long audio\-video understanding, where context\-window pressure and information redundancy are most severe\. Second, AVOC delivers consistent improvements over OmniZip, which is the most directly comparable token\-compression baseline, with gains of \+6\.3 on WorldSense, \+6\.2 on OmniVideoBench Avg\., and \+7\.8 on LVOmniBench Avg\.\. This validates the effectiveness of our compression design\.

### 4\.3Audio\-Video Needle\-in\-a\-Haystack

#### Evaluation Settings\.

To assess the fine\-grained retrieval capability of AVOC over long audio\-video streams, we construct an Audio\-Video Needle\-in\-a\-Haystack \(AV\-NIAH\) evaluation\. For each audio\-video haystack, we inject a "needle" carrying a secret keyword—a randomly generated 6 digit numeric string—at a controlled temporal position\. The needle is instantiated in two modalities: \(i\) a*vision needle*, rendered as a caption "The secret word is<needle\>" overlaid on a single video frame; and \(ii\) an*audio needle*, synthesized via text\-to\-speech reading "The secret word is<needle\>" and spliced into the audio stream\. During evaluation, videos are sampled at 1 FPS and fed into the model together with the full accompanying audio stream\. We evaluate the visual and auditory needles separately, prompting the model with the query "What is the secret number?" and requiring it to localize and extract the digit string from the target modality\. We iterate over various needle depths \(where the needle is placed\) and audio\-video lengths \(up to 3600 seconds\) to measure the performance, and report accuracy as the exact\-match rate between the predicted and ground\-truth digit strings\. The more detailed evaluation setup is in the Appendix[A\.1](https://arxiv.org/html/2606.24286#A1.SS1)\.

![Refer to caption](https://arxiv.org/html/2606.24286v1/x3.png)Figure 3:Audio\-Video Needle\-in\-a\-Haystack results\. Each cell reports retrieval accuracy at a given audio\-video duration \(x\-axis\) and relative needle depth \(y\-axis\)\.
#### Performance\.

As shown in Figure[3](https://arxiv.org/html/2606.24286#S4.F3), OmniZip exhibits a clear duration\-induced collapse: its accuracy degrades substantially beyond 2000s on the vision needle \(Figure[3](https://arxiv.org/html/2606.24286#S4.F3)a\) and beyond 3000s on the audio needle \(Figure[3](https://arxiv.org/html/2606.24286#S4.F3)b\)\. In contrast, AVOC maintains high retrieval accuracy across the entire duration\-depth grid for both modalities, with only minor degradation appearing at isolated cells beyond 3000s \(Figure[3](https://arxiv.org/html/2606.24286#S4.F3)c, d\)\. These results demonstrate AVOC’s capability in ultra\-long audio\-video context modeling\. Additional AV\-NIAH results on more baselines are provided in the Appendix[A\.2](https://arxiv.org/html/2606.24286#A1.SS2)\.

### 4\.4Ablation Studies

#### Effect of Compression Components\.

To validate the effectiveness of each component in our compression module, we conduct a series of ablation studies on OmniVideoBench and LVOmniBench\. All ablated variants in Table[2](https://arxiv.org/html/2606.24286#S4.T2)adopt the same default hyperparameter configuration as in Section[4\.2](https://arxiv.org/html/2606.24286#S4.SS2.SSS0.Px1), isolating the effect of compression components\. As shown in Table[2](https://arxiv.org/html/2606.24286#S4.T2), using random selection yields a substantial performance drop compared to our full model\. This indicates that, under tight token budgets, which tokens are retained matters far more than how many, confirming the necessity of a content\-aware compression mechanism for ultra\-long audio\-video understanding\. Beyond this, removing any single component consistently degrades performance, demonstrating that relevance, importance, and diversity contribute complementary rather than redundant signals\.

Table 2:Ablation study on the compression components of AVOC\. TGS: Text\-Guided cross\-attention Scoring; VAS: Video\-Audio cross\-attention Scoring; TA\-MMR: Temporal\-Aware Maximal Marginal Relevance\. “Random” replaces the scoring\-and\-selection procedure with uniform random sampling under the identical token budget and modality allocation\.![Refer to caption](https://arxiv.org/html/2606.24286v1/x4.png)Figure 4:Ablation on the diversity coefficientλ\\lambda\(left\) and the local window radiusWW\(right\) of TA\-MMR\.
![Refer to caption](https://arxiv.org/html/2606.24286v1/x5.png)Figure 5:Ablation on modality token budget ratio\.

#### Effect of TA\-MMR Hyperparameters\.

We examine the two key hyperparameters of TA\-MMR: the diversity coefficientλ\\lambdaand the local window radiusWW\. As shown in Figure[5](https://arxiv.org/html/2606.24286#S4.F5), both hyperparameters exhibit a unimodal trend on OmniVideoBench and LVOmniBench, peaking atλ=0\.15\\lambda=0\.15andW=3W=3\. Settingλ\\lambdaorWWtoo small leaves adjacent duplicated tokens unpenalized, whereas too largeλ\\lambdaorWWbiases selection toward merely dissimilar tokens\. Notably, whenW→∞W\\\!\\to\\\!\\infty, TA\-MMR degenerates into the standard MMR, and the observed performance drop empirically validates the necessity of our temporal\-window design\.

#### Effect of Modality Token Budget Ratio\.

We investigate the impact of the modality token budget allocationKvideo:KaudioK\_\{\\text\{video\}\}:K\_\{\\text\{audio\}\}\. As shown in Figure[5](https://arxiv.org/html/2606.24286#S4.F5), performance on both OmniVideoBench and LVOmniBench peaks at2:12\{:\}1\. An audio\-leaning allocation \(e\.g\.,1:21\{:\}2or1:11\{:\}1\) under\-represents the dense visual cues, whereas an overly video\-skewed allocation \(3:13\{:\}1\) starves the audio stream of speech, environmental sounds, and music that carry irreplaceable semantic information\. These results indicate that a moderately video\-leaning budget best matches the information density of real\-world long\-form audio\-video content\.

### 4\.5Efficiency Analyses

We compare the latency of AVOC against its backbone MiniCPM\-o 4\.5, which shares the identical architecture except for the compression module\. For a fair comparison, both models are fed with the same 10\-minute video uniformly sampled to 128 frames, together with the full accompanying audio stream\. We evaluate AVOC under three token retention ratios \(ρ∈\{1\.0,0\.5,0\.1\}\\rho\\in\\\{1\.0,0\.5,0\.1\\\}\) to examine how latency varies with compression aggressiveness\. All measurements are performed on a single NVIDIA A800 GPU with BF16 precision and flash\-attention2, and we report the average time over multiple runs to mitigate measurement noise\.

As summarized in Table[3](https://arxiv.org/html/2606.24286#S4.T3), the compression module introduces only modest overhead\. Even atρ=1\.0\\rho=1\.0where no token is dropped, the compression module adds merely 1\.834 s, yielding a slight Time\-to\-First\-Token increase over the backbone\. This confirms that the compression module is computationally lightweight relative to the LLM forward pass\. More importantly, reducing the retention ratio yields substantial prefilling speedups: prefilling latency drops from 4\.453 s \(backbone\) to 2\.088 s atρ=0\.5\\rho=0\.5and 0\.497 s atρ=0\.1\\rho=0\.1, a nearly9×9\\timesreduction\. Meanwhile, the compression module’s own cost scales nearly proportionally withρ\\rho\(1\.834 s, 0\.929 s, 0\.260 s forρ=1\.0,0\.5,0\.1\\rho=1\.0,0\.5,0\.1\), so more aggressive compression incurs smaller compression\-time overhead, making AVOC particularly suitable for hour\-level audio\-video understanding scenarios that demand high compression ratios\.

Table 3:Latency comparison between AVOC and its backbone MiniCPM\-o 4\.5\.ρ\\rhodenotes the token retention ratio\.

## 5Conclusion

We presented AVOC, a framework that enhances hour\-level audio\-video understanding in Omni\-Modal LLMs through a learnable token compression module\. Drawing on classical principles from information retrieval, the module instantiates three complementary criteria: text\-guided relevance, bidirectional video\-audio importance, and Temporal\-Aware Maximal Marginal Relevance for local diversity\. These criteria jointly guide the selection of a compact, informative token subset under a tight context budget\. Experiments on multiple long\-form audio\-video benchmarks show that AVOC achieves state\-of\-the\-art performance, surpassing the second\-best method by up to 5\.5 points in average accuracy, and maintains robust retrieval on Audio\-Video Needle\-in\-a\-Haystack task at durations up to one hour\. We hope AVOC offers a step toward Omni\-Modal LLMs capable of reasoning over the rich, hour\-long multimodal content that pervades real\-world applications\.

## References

- \[1\]H\. AlAmri, V\. Cartillier, A\. Das, J\. Wang, A\. Cherian, I\. Essa, D\. Batra, T\. K\. Marks, C\. Hori, P\. Anderson, S\. Lee, and D\. Parikh\(2019\)Audio visual scene\-aware dialog\.InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16\-20, 2019,pp\. 7558–7567\.External Links:[Link](http://openaccess.thecvf.com/content%5C_CVPR%5C_2019/html/Alamri%5C_Audio%5C_Visual%5C_Scene-Aware%5C_Dialog%5C_CVPR%5C_2019%5C_paper.html)Cited by:[§4\.1](https://arxiv.org/html/2606.24286#S4.SS1.SSS0.Px1.p1.1)\.
- \[2\]J\. Carbonell and J\. Goldstein\(1998\)The use of mmr, diversity\-based reranking for reordering documents and producing summaries\.InProceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval,pp\. 335–336\.Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.24286#S3.SS1.p2.4),[§3\.4](https://arxiv.org/html/2606.24286#S3.SS4.p2.5)\.
- \[3\]J\. Chao, J\. Gao, W\. Tan, Y\. Sun, R\. Song, and L\. Ru\(2025\)JointAVBench: a benchmark for joint audio\-visual reasoning evaluation\.arXiv preprint arXiv:2512\.12772\.Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p1.1)\.
- \[4\]L\. Chen, H\. Zhao, T\. Liu, S\. Bai, J\. Lin, C\. Zhou, and B\. Chang\(2024\)An image is worth 1/2 tokens after layer 2: plug\-and\-play inference acceleration for large vision\-language models\.InComputer Vision \- ECCV 2024 \- 18th European Conference, Milan, Italy, September 29\-October 4, 2024, Proceedings, Part LXXXI,A\. Leonardis, E\. Ricci, S\. Roth, O\. Russakovsky, T\. Sattler, and G\. Varol \(Eds\.\),Lecture Notes in Computer Science,pp\. 19–35\.External Links:[Link](https://doi.org/10.1007/978-3-031-73004-7%5C_2),[Document](https://dx.doi.org/10.1007/978-3-031-73004-7%5F2)Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[5\]Y\. Chen, Y\. Wu, K\. Guan, Y\. Ren, Y\. Wang, R\. Song, and L\. Ru\(2025\)ChronusOmni: improving time awareness of omni large language models\.arXiv preprint arXiv:2512\.09841\.Cited by:[§4\.1](https://arxiv.org/html/2606.24286#S4.SS1.SSS0.Px1.p1.1)\.
- \[6\]Y\. Chen, W\. Huang, B\. Shi, Q\. Hu, H\. Ye, L\. Zhu, Z\. Liu, P\. Molchanov, J\. Kautz, X\. Qi, S\. Liu, H\. Yin, Y\. Lu, and S\. Han\(2025\)Scaling rl to long videos\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[7\]Y\. Chen, F\. Xue, D\. Li, Q\. Hu, L\. Zhu, X\. Li, Y\. Fang, H\. Tang, S\. Yang, Z\. Liu, Y\. He, H\. Yin, P\. Molchanov, J\. Kautz, L\. Fan, Y\. Zhu, Y\. Lu, and S\. Han\(2025\)LongVILA: scaling long\-context visual language models for long videos\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=wCXAlfvCy6)Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.24286#S4.SS1.SSS0.Px1.p1.1)\.
- \[8\]Z\. Cheng, S\. Leng, H\. Zhang, Y\. Xin, X\. Li, G\. Chen, Y\. Zhu, W\. Zhang, Z\. Luo, D\. Zhao,et al\.\(2024\)Videollama 2: advancing spatial\-temporal modeling and audio understanding in video\-llms\.arXiv preprint arXiv:2406\.07476\.Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p1.1),[Table 1](https://arxiv.org/html/2606.24286#S4.T1.1.1.4.3.1)\.
- \[9\]C\. L\. A\. Clarke, M\. Kolla, G\. V\. Cormack, O\. Vechtomova, A\. Ashkan, S\. Büttcher, and I\. MacKinnon\(2008\)Novelty and diversity in information retrieval evaluation\.InProceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20\-24, 2008,S\. Myaeng, D\. W\. Oard, F\. Sebastiani, T\. Chua, and M\. Leong \(Eds\.\),pp\. 659–666\.External Links:[Link](https://doi.org/10.1145/1390334.1390446),[Document](https://dx.doi.org/10.1145/1390334.1390446)Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.24286#S3.SS1.p2.4)\.
- \[10\]J\. Cui, B\. Xu, C\. Wang, T\. Yu, W\. Sun, Y\. Xu, T\. Wang, Z\. He, W\. Ma, T\. Cai,et al\.\(2026\)MiniCPM\-o 4\.5: towards real\-time full\-duplex omni\-modal interaction\.arXiv preprint arXiv:2604\.27393\.Cited by:[§A\.2](https://arxiv.org/html/2606.24286#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.24286#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.24286#S3.SS1.p1.8),[§4\.1](https://arxiv.org/html/2606.24286#S4.SS1.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.24286#S4.T1.1.1.10.9.1),[Table 1](https://arxiv.org/html/2606.24286#S4.T1.1.1.9.8.1)\.
- \[11\]Y\. Ding, Y\. Ji, J\. Li, X\. Liu, X\. Chen, J\. Wu, B\. Li, B\. Zeng, Y\. Shi, Y\. Guan, Y\. Zhang, J\. Liu, Q\. Liu, P\. Wan, and L\. Wang\(2026\)OmniSIFT: modality\-asymmetric token compression for efficient omni\-modal large language models\.CoRRabs/2602\.04804\.External Links:[Link](https://doi.org/10.48550/arXiv.2602.04804),[Document](https://dx.doi.org/10.48550/ARXIV.2602.04804),2602\.04804Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p2.1),[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px2.p1.1)\.
- \[12\]M\. Farré, A\. Marafioti, L\. Tunstall, L\. Von Werra, and T\. Wolf\(2024\)FineVideo\.Note:[https://huggingface\.co/datasets/HuggingFaceFV/finevideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo)Cited by:[§4\.1](https://arxiv.org/html/2606.24286#S4.SS1.SSS0.Px1.p1.1)\.
- \[13\]J\. Hong, S\. Yan, J\. Cai, X\. Jiang, Y\. Hu, and W\. Xie\(2025\)Worldsense: evaluating real\-world omnimodal understanding for multimodal llms\.arXiv preprint arXiv:2502\.04326\.Cited by:[§4\.2](https://arxiv.org/html/2606.24286#S4.SS2.SSS0.Px1.p1.1)\.
- \[14\]E\. Jang, S\. Gu, and B\. Poole\(2017\)Categorical reparameterization with gumbel\-softmax\.In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24\-26, 2017, Conference Track Proceedings,External Links:[Link](https://openreview.net/forum?id=rkE3y85ee)Cited by:[§4\.1](https://arxiv.org/html/2606.24286#S4.SS1.SSS0.Px2.p1.5)\.
- \[15\]P\. Jin, R\. Takanobu, W\. Zhang, X\. Cao, and L\. Yuan\(2024\)Chat\-univi: unified visual representation empowers large language models with image and video understanding\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16\-22, 2024,pp\. 13700–13710\.External Links:[Link](https://doi.org/10.1109/CVPR52733.2024.01300),[Document](https://dx.doi.org/10.1109/CVPR52733.2024.01300)Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[16\]V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih\(2020\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16\-20, 2020,B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),pp\. 6769–6781\.Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.24286#S3.SS1.p2.4)\.
- \[17\]C\. Li, Y\. Chen, Y\. Ji, J\. Xu, Z\. Cui, S\. Li, Y\. Zhang, W\. Wang, Z\. Song, D\. Zhang,et al\.\(2025\)Omnivideobench: towards audio\-visual understanding evaluation for omni mllms\.arXiv preprint arXiv:2510\.10689\.Cited by:[§4\.2](https://arxiv.org/html/2606.24286#S4.SS2.SSS0.Px1.p1.1)\.
- \[18\]G\. Li, Y\. Wei, Y\. Tian, C\. Xu, J\. Wen, and D\. Hu\(2022\)Learning to answer questions in dynamic audio\-visual scenarios\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 19108–19118\.Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p1.1)\.
- \[19\]X\. Li, Y\. Wang, J\. Yu, X\. Zeng, Y\. Zhu, H\. Huang, J\. Gao, K\. Li, Y\. He, C\. Wang, Y\. Qiao, Y\. Wang, and L\. Wang\(2025\)VideoChat\-flash: hierarchical compression for long\-context video modeling\.CoRRabs/2501\.00574\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.00574),[Document](https://dx.doi.org/10.48550/ARXIV.2501.00574),2501\.00574Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]Y\. Li, J\. Liu, T\. Zhang, S\. Chen, T\. Li, Z\. Li, L\. Liu, L\. Ming, G\. Dong, D\. Pan,et al\.\(2025\)Baichuan\-omni\-1\.5 technical report\.arXiv preprint arXiv:2501\.15368\.Cited by:[Table 1](https://arxiv.org/html/2606.24286#S4.T1.1.1.5.4.1)\.
- \[21\]Y\. Li, C\. Wang, and J\. Jia\(2024\)LLaMA\-vid: an image is worth 2 tokens in large language models\.InComputer Vision \- ECCV 2024 \- 18th European Conference, Milan, Italy, September 29\-October 4, 2024, Proceedings, Part XLVI,A\. Leonardis, E\. Ricci, S\. Roth, O\. Russakovsky, T\. Sattler, and G\. Varol \(Eds\.\),Lecture Notes in Computer Science,pp\. 323–340\.External Links:[Link](https://doi.org/10.1007/978-3-031-72952-2%5C_19),[Document](https://dx.doi.org/10.1007/978-3-031-72952-2%5F19)Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[22\]Y\. Li, G\. Zhang, Y\. Ma, R\. Yuan, K\. Zhu, H\. Guo, Y\. Liang, J\. Liu, J\. Yang, S\. Wu, X\. Qu, J\. Shi, X\. Zhang, Z\. Yang, X\. Wang, Z\. Zhang, Z\. Liu, E\. Benetos, W\. Huang, and C\. Lin\(2024\)OmniBench: towards the future of universal omni\-language models\.CoRRabs/2409\.15272\.External Links:[Link](https://doi.org/10.48550/arXiv.2409.15272),[Document](https://dx.doi.org/10.48550/ARXIV.2409.15272),2409\.15272Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p1.1)\.
- \[23\]H\. Liu, W\. Yan, M\. Zaharia, and P\. Abbeel\(2025\)World model on million\-length video and language with blockwise ringattention\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[24\]Z\. Liu, Y\. Dong, J\. Wang, Z\. Liu, W\. Hu, J\. Lu, and Y\. Rao\(2025\)Ola: pushing the frontiers of omni\-modal language model with progressive modality alignment\.CoRRabs/2502\.04328\.External Links:[Link](https://doi.org/10.48550/arXiv.2502.04328),[Document](https://dx.doi.org/10.48550/ARXIV.2502.04328),2502\.04328Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px2.p1.1)\.
- \[25\]M\. Maaz, H\. A\. Rasheed, S\. Khan, and F\. Khan\(2024\)Video\-chatgpt: towards detailed video understanding via large vision and language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 12585–12602\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.679),[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.679)Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[26\]L\. Page, S\. Brin, R\. Motwani, and T\. Winograd\(1999\)The pagerank citation ranking: bring order to the web\.InProc\. of the 7th International World Wide Web Conf\.–1998,Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.24286#S3.SS1.p2.4),[§3\.3](https://arxiv.org/html/2606.24286#S3.SS3.p1.1)\.
- \[27\]S\. E\. Robertson and H\. Zaragoza\(2009\)The probabilistic relevance framework: BM25 and beyond\.Found\. Trends Inf\. Retr\.3\(4\),pp\. 333–389\.Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.24286#S3.SS1.p2.4)\.
- \[28\]R\. Sanabria, O\. Caglayan, S\. Palaskar,et al\.\(2018\)How2: A large\-scale dataset for multimodal language understanding\.CoRRabs/1811\.00347\.Cited by:[§4\.1](https://arxiv.org/html/2606.24286#S4.SS1.SSS0.Px1.p1.1)\.
- \[29\]K\. Shao, K\. Tao, K\. Zhang, S\. Feng, M\. Cai, Y\. Shang, H\. You, C\. Qin, Y\. Sui, and H\. Wang\(2025\)When tokens talk too much: A survey of multimodal long\-context token compression across images, videos, and audios\.CoRRabs/2507\.20198\.External Links:[Link](https://doi.org/10.48550/arXiv.2507.20198),[Document](https://dx.doi.org/10.48550/ARXIV.2507.20198),2507\.20198Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[30\]X\. Shen, Y\. Xiong, C\. Zhao, L\. Wu, J\. Chen, C\. Zhu, Z\. Liu, F\. Xiao, B\. Varadarajan, F\. Bordes, Z\. Liu, H\. Xu, H\. J\. Kim, B\. Soran, R\. Krishnamoorthi, M\. Elhoseiny, and V\. Chandra\(2025\)LongVU: spatiotemporal adaptive compression for long video\-language understanding\.InForty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, and J\. Zhu \(Eds\.\),Proceedings of Machine Learning Research\.Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[31\]Y\. Shu, Z\. Liu, P\. Zhang, M\. Qin, J\. Zhou, Z\. Liang, T\. Huang, and B\. Zhao\(2025\)Video\-xl: extra\-long vision language model for hour\-scale video understanding\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11\-15, 2025,pp\. 26160–26169\.External Links:[Link](https://openaccess.thecvf.com/content/CVPR2025/html/Shu%5C_Video-XL%5C_Extra-Long%5C_Vision%5C_Language%5C_Model%5C_for%5C_Hour-Scale%5C_Video%5C_Understanding%5C_CVPR%5C_2025%5C_paper.html),[Document](https://dx.doi.org/10.1109/CVPR52734.2025.02436)Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[32\]E\. Song, W\. Chai, G\. Wang, Y\. Zhang, H\. Zhou, F\. Wu, H\. Chi, X\. Guo, T\. Ye, Y\. Zhang, Y\. Lu, J\. Hwang, and G\. Wang\(2024\)MovieChat: from dense token to sparse memory for long video understanding\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16\-22, 2024,pp\. 18221–18232\.External Links:[Link](https://doi.org/10.1109/CVPR52733.2024.01725),[Document](https://dx.doi.org/10.1109/CVPR52733.2024.01725)Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[33\]W\. Tan, X\. Yu, J\. Li, Y\. Chen, J\. Ju, Z\. Luo, R\. Song, and J\. Luan\(2026\)MSJoE: jointly evolving mllm and sampler for efficient long\-form video understanding\.arXiv preprint arXiv:2602\.22932\.Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[34\]C\. Tang, Y\. Li, Y\. Yang, J\. Zhuang, G\. Sun, W\. Li, Z\. Ma, and C\. Zhang\(2025\)Video\-salmonn 2: caption\-enhanced audio\-visual large language models\.arXiv preprint arXiv:2506\.15220\.Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p1.1),[Table 1](https://arxiv.org/html/2606.24286#S4.T1.1.1.8.7.1)\.
- \[35\]K\. Tao, K\. Shao, B\. Yu, W\. Wang, J\. Liu, and H\. Wang\(2025\)OmniZip: audio\-guided dynamic token compression for fast omnimodal large language models\.CoRRabs/2511\.14582\.External Links:[Link](https://doi.org/10.48550/arXiv.2511.14582),[Document](https://dx.doi.org/10.48550/ARXIV.2511.14582),2511\.14582Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p2.1),[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.24286#S4.T1.1.1.11.10.1)\.
- \[36\]K\. Tao, Y\. Zheng, J\. Xu, W\. Du, K\. Shao, H\. Wang, X\. Chen, X\. Jin, J\. Zhu, B\. Yu,et al\.\(2026\)LVOmniBench: pioneering long audio\-video understanding evaluation for omnimodal llms\.arXiv preprint arXiv:2603\.19217\.Cited by:[§4\.2](https://arxiv.org/html/2606.24286#S4.SS2.SSS0.Px1.p1.1)\.
- \[37\]Q\. Team\(2025\)Qwen3\-omni technical report\.CoRRabs/2509\.17765\.External Links:[Link](https://doi.org/10.48550/arXiv.2509.17765),[Document](https://dx.doi.org/10.48550/ARXIV.2509.17765),2509\.17765Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.24286#S3.SS1.p1.8)\.
- \[38\]H\. Wei and Z\. Chen\(2025\)Visual context window extension: A new perspective for long video understanding\.InProceedings of the 33rd ACM International Conference on Multimedia, MM 2025, Dublin, Ireland, October 27\-31, 2025,C\. Gurrin, K\. Schoeffmann, M\. Zhang, L\. Rossetto, S\. Rudinac, D\. Dang\-Nguyen, W\. Cheng, P\. Chen, and J\. Benois\-Pineau \(Eds\.\),pp\. 4281–4289\.External Links:[Link](https://doi.org/10.1145/3746027.3755383),[Document](https://dx.doi.org/10.1145/3746027.3755383)Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[39\]Y\. Weng, M\. Han, H\. He, X\. Chang, and B\. Zhuang\(2024\)LongVLM: efficient long video understanding via large language models\.InComputer Vision \- ECCV 2024 \- 18th European Conference, Milan, Italy, September 29\-October 4, 2024, Proceedings, Part XXXIII,A\. Leonardis, E\. Ricci, S\. Roth, O\. Russakovsky, T\. Sattler, and G\. Varol \(Eds\.\),Lecture Notes in Computer Science,pp\. 453–470\.External Links:[Link](https://doi.org/10.1007/978-3-031-73414-4%5C_26),[Document](https://dx.doi.org/10.1007/978-3-031-73414-4%5F26)Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[40\]J\. Xu, Z\. Guo, J\. He, H\. Hu, T\. He, S\. Bai, K\. Chen, J\. Wang, Y\. Fan, K\. Dang, B\. Zhang, X\. Wang, Y\. Chu, and J\. Lin\(2025\)Qwen2\.5\-omni technical report\.CoRRabs/2503\.20215\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.20215),[Document](https://dx.doi.org/10.48550/ARXIV.2503.20215),2503\.20215Cited by:[§A\.2](https://arxiv.org/html/2606.24286#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.24286#S1.p1.1),[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.24286#S3.SS1.p1.8),[Table 1](https://arxiv.org/html/2606.24286#S4.T1.1.1.7.6.1)\.
- \[41\]S\. Yang, Y\. Chen, Z\. Tian, C\. Wang, J\. Li, B\. Yu, and J\. Jia\(2025\)Visionzip: longer is better but not necessary in vision language models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 19792–19802\.Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[42\]H\. Ye, C\. H\. Yang, A\. Goel, W\. Huang, L\. Zhu, Y\. Su, S\. Lin, A\. Cheng, Z\. Wan, J\. Tian, Y\. Lou, D\. Yang, Z\. Liu, Y\. Chen, A\. Dantrey, E\. Jahangiri, S\. Ghosh, D\. Xu, E\. Hosseini\-Asl, D\. Mohseni\-Taheri, V\. Murali, S\. Liu, Y\. Lu, O\. Olabiyi, Y\. F\. Wang, R\. Valle, B\. Catanzaro, A\. Tao, S\. Han, J\. Kautz, H\. Yin, and P\. Molchanov\(2025\)OmniVinci: enhancing architecture and data for omni\-modal understanding LLM\.CoRRabs/2510\.15870\.External Links:[Link](https://doi.org/10.48550/arXiv.2510.15870),[Document](https://dx.doi.org/10.48550/ARXIV.2510.15870),2510\.15870Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.24286#S3.SS1.p1.8)\.
- \[43\]C\. Zhang, K\. Ma, T\. Fang, W\. Yu, H\. Zhang, Z\. Zhang, H\. Mi, and D\. Yu\(2026\)VScan: rethinking visual token reduction for efficient large vision\-language models\.Trans\. Mach\. Learn\. Res\.2026\.External Links:[Link](https://openreview.net/forum?id=KZYhyilFnt)Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[44\]P\. Zhang, K\. Zhang, B\. Li, G\. Zeng, J\. Yang, Y\. Zhang, Z\. Wang, H\. Tan, C\. Li, and Z\. Liu\(2025\)Long context transfer from language to vision\.Trans\. Mach\. Learn\. Res\.2025\.External Links:[Link](https://openreview.net/forum?id=30RAWQVGlx)Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[45\]Q\. Zhang, A\. Cheng, M\. Lu, R\. Zhang, Z\. Zhuo, J\. Cao, S\. Guo, Q\. She, and S\. Zhang\(2025\)Beyond text\-visual attention: exploiting visual cues for effective token pruning in vlms\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 20857–20867\.Cited by:[§2](https://arxiv.org/html/2606.24286#S2.SS0.SSS0.Px1.p1.1)\.
- \[46\]J\. Zhao, Q\. Yang, Y\. Peng, D\. Bai, S\. Yao, B\. Sun, X\. Chen, S\. Fu, X\. Wei, L\. Bo,et al\.\(2025\)Humanomni: a large vision\-speech language model for human\-centric video understanding\.arXiv preprint arXiv:2501\.15111\.Cited by:[Table 1](https://arxiv.org/html/2606.24286#S4.T1.1.1.6.5.1)\.
- \[47\]Z\. Zhou, R\. Wang, Z\. Wu, and Y\. Jiang\(2025\)Daily\-omni: towards audio\-visual reasoning with temporal alignment across modalities\.arXiv preprint arXiv:2505\.17862\.Cited by:[§1](https://arxiv.org/html/2606.24286#S1.p1.1)\.

## Appendix AAdditional Details and Results on Audio\-Video Needle\-in\-a\-Haystack

### A\.1Evaluation Setting Details\.

We provide additional details on the construction and evaluation protocol of the Audio\-Video Needle\-in\-a\-Haystack \(AV\-NIAH\) task introduced in Section[4\.3](https://arxiv.org/html/2606.24286#S4.SS3)\.

#### Haystack source\.

We use a long\-form audio\-video clip drawn from LVOmniBench with a total duration exceeding 60 minutes as the haystack\. To construct samples of varying lengths, we truncate the clip to target durations ranging from 100s to 3600s with a step size of 100s, yielding 26 duration settings in total\. The accompanying audio stream is preserved in alignment with the truncated video\.

#### Needle generation\.

Each needle carries a secret keyword instantiated as a randomly generated 6\-digit numeric string\. We sample 5 independent needles in total, and report the average accuracy across these 5 samples for each \(duration, depth\) cell to mitigate the variance introduced by individual needle realizations\. The needle is rendered in two modalities, evaluated separately:

\(1\) Vision needle\. The keyword is rendered as the caption "The secret word is <needle\>" and overlaid on a single video frame at the target temporal position\.

\(2\) Audio needle\. The audio clip reading ’The secret word is <needle\>’ is synthesized using Qwen3\-TTS\. To ensure the audio needle blends naturally into the haystack rather than standing out as an acoustic outlier, we normalize the loudness of the synthesized audio clip to match the average volume of the haystack audio stream before splicing it into the target temporal position\.

#### Duration–depth grid\.

For each duration setting, we vary the relative needle depth \(i\.e\., the normalized temporal position of the needle within the clip\) over 11 evenly spaced values from 0\.0 to 1\.0 with a step size of 0\.1\. This yields a 26 × 11 \(duration × depth\) evaluation grid for each modality\. At each cell, we average over the 5 needle samples described above\.

#### Inference protocol\.

AV\-NIAH does not impose a maximum frame number or audio length cap\. For both AVOC and all baselines, videos are uniformly sampled at 1 FPS and the full accompanying audio stream is fed into the model, ensuring that the needle is never discarded by preprocessing\-stage subsampling\. For AVOC, the compression module is activated with global token budgetK=25000K=25000, modality allocationKvideo:Kaudio=2:1K\_\{\\text\{video\}\}:K\_\{\\text\{audio\}\}=2\{:\}1, TA\-MMR diversity coefficientλ=0\.15\\lambda=0\.15, local temporal window radiusW=3W=3\. The model is prompted with the query“What is the secret number?”and asked to localize and extract the digit string from the target modality\. Vision and audio needles are evaluated in independent runs\.

### A\.2Extended Baselines Results\.

![Refer to caption](https://arxiv.org/html/2606.24286v1/x6.png)Figure 6:Extended Audio\-Video Needle\-in\-a\-Haystack results across four models\. Each row corresponds to one model, and each cell reports retrieval accuracy at a given audio\-video duration \(x\-axis, 100s–3600s\) and relative needle depth \(y\-axis, 0\.0–1\.0\)\. The left and right columns report vision needle and audio needle retrieval accuracy, respectively\.To provide a more comprehensive view of fine\-grained retrieval capability across long audio\-video durations, we extend the AV\-NIAH evaluation in Section[4\.3](https://arxiv.org/html/2606.24286#S4.SS3)to two additional baselines: MiniCPM\-o 4\.5Cuiet al\.\([2026](https://arxiv.org/html/2606.24286#bib.bib26)\)\(the backbone of AVOC\) and Qwen2\.5\-Omni\-7BXuet al\.\([2025](https://arxiv.org/html/2606.24286#bib.bib22)\)\. The complete results across all four models are presented in Figure[6](https://arxiv.org/html/2606.24286#A1.F6), where each row corresponds to one model and the two columns report the vision needle and audio needle retrieval accuracy, respectively\.

#### MiniCPM\-o 4\.5\.

As shown in Figure[6](https://arxiv.org/html/2606.24286#A1.F6)\(a, b\), MiniCPM\-o 4\.5 exhibits a severe and immediate context\-window collapse: its successful retrieval is restricted to durations below approximately 300 seconds for both vision and audio needles, and drops to near\-zero across the entire duration\-depth grid beyond this threshold\. This collapse stems from the rigid context\-window constraint of the backbone, which is unable to accommodate the dense token sequences produced by hour\-level audio\-video streams without aggressive content\-agnostic truncation\.

#### Qwen2\.5\-Omni\.

As shown in Figure[6](https://arxiv.org/html/2606.24286#A1.F6)\(c, d\), Qwen2\.5\-Omni extends the effective retrieval range substantially compared to MiniCPM\-o 4\.5, but still exhibits a clear duration\-induced degradation\. On the vision needle, accuracy degrades noticeably beyond 1500s and collapses to near\-zero beyond 2300s\. The audio needle is comparatively more robust, sustaining moderate accuracy up to around 2500s before degrading sharply at longer durations\.

#### OmniZip\.

As shown in Figure[6](https://arxiv.org/html/2606.24286#A1.F6)\(e, f\), OmniZip pushes the vision\-needle effective retrieval range, achieving a moderate improvement over Qwen2\.5\-Omni\. However, as the duration approaches one hour, the vision\-needle accuracy degrades substantially across nearly all depths, and the audio needle exhibits lower accuracy particularly beyond 3000s at shallow\-to\-mid depths\.

#### AVOC\.

In contrast, as shown in Figure[6](https://arxiv.org/html/2606.24286#A1.F6)\(g, h\), AVOC maintains consistently high retrieval accuracy across the entire 100s–3600s duration range and across all needle depths for both vision and audio needles, with only minor degradation appearing at isolated cells beyond 3000s\. This demonstrates AVOC’s robust fine\-grained information localization capability over hour\-level audio\-video streams\.

## Appendix BLimitations

Despite the strong empirical results, AVOC has several limitations that point to directions for future work\. First, the compression module operates in an offline manner: the entire audio\-video stream must be available before scoring and selection can be performed, which prevents AVOC from being directly applied to streaming scenarios where tokens arrive incrementally\. Extending the framework to causal or chunk\-wise online settings is a natural next step\. Second, our experiments are conducted at a single, relatively small parameter scale; whether the proposed compression mechanism scales gracefully to larger backbones remains to be verified\. Third, the modality token budget allocation ratioKvideo:KaudioK\_\{\\text\{video\}\}:K\_\{\\text\{audio\}\}is currently set as a fixed hyperparameter\. A content\-adaptive allocation that dynamically adjusts the ratio based on per\-sample audio\-video information density could further improve robustness across diverse multimodal distributions\.
AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

Similar Articles

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

AdaCodec: A Predictive Visual Code for Video MLLMs

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Submit Feedback

Similar Articles

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs
AdaCodec: A Predictive Visual Code for Video MLLMs
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence