LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

arXiv cs.CL Papers

Summary

LatentOmni proposes a unified latent space for audio-visual reasoning, avoiding the information loss of text-based chain-of-thought. It achieves state-of-the-art performance among open-source models on audio-visual reasoning benchmarks.

arXiv:2605.22012v1 Announce Type: new Abstract: Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:45 AM

# Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
Source: [https://arxiv.org/html/2605.22012](https://arxiv.org/html/2605.22012)
Yifan Dai1,2, Zhenhua Wu2, Bohan Zeng3,2, Daili Hua3, Jialing Liu7, Bozhou Li3,2, Yuran Wang3,2,Chengzhuo Tong3,2,Hao Liang3,Xiaochen Ma4,Junbo Niu3, Tianyu Guo3,Yang Shi3,2,Yue Ding5,2,Yiyan Ji6,2,Bingyin Mei8, Yushuo Guan2,Yuanxing Zhang2,Pengfei Wan2,Fangcheng Fu1,Wentao Zhang3 1School of AI, Shanghai Jiao Tong University,2Kling Team, Kuaishou Technology,3Peking University, 4HKUST,5CASIA,6Nanjing University,7Renmin University of China,8Tsinghua University

###### Abstract

Joint audio\-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models \(MLLMs\) still struggle when reasoning requires fine\-grained evidence from both modalities\. A central limitation is that explicit text\-based chain\-of\-thought \(CoT\) compresses continuous audio\-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors\. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation\. Based on this insight, we proposeLatentOmni, a cross\-modal reasoning framework that interleaves textual reasoning with audio\-visual latent states\. LatentOmni introduces feature\-level supervision to align latent reasoning states with task\-relevant sensory features and uses Omni\-Sync Position Embedding \(OSPE\) to maintain temporal consistency between latent audio and visual states\. We further constructLatentOmni\-Instruct\-35K, a dataset of audio\-visual interleaved reasoning trajectories for supervising latent\-space reasoning\. Comprehensive evaluation across multiple audio\-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open\-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent\-space joint reasoning as a promising path toward stronger omnimodal understanding\.

## 1Introduction

Information in the real world is inherently multimodal\[[14](https://arxiv.org/html/2605.22012#bib.bib45),[57](https://arxiv.org/html/2605.22012#bib.bib44)\], and artificial agents must jointly interpret what they see and hear to understand events, causality, and context\[[58](https://arxiv.org/html/2605.22012#bib.bib1),[1](https://arxiv.org/html/2605.22012#bib.bib2),[54](https://arxiv.org/html/2605.22012#bib.bib60),[48](https://arxiv.org/html/2605.22012#bib.bib10)\]\. Recent multimodal large language models \(MLLMs\) have made notable progress on audio\-visual perception tasks such as captioning and grounding\[[3](https://arxiv.org/html/2605.22012#bib.bib3),[53](https://arxiv.org/html/2605.22012#bib.bib59),[4](https://arxiv.org/html/2605.22012#bib.bib9),[30](https://arxiv.org/html/2605.22012#bib.bib62),[7](https://arxiv.org/html/2605.22012#bib.bib4),[43](https://arxiv.org/html/2605.22012#bib.bib8)\], yet they remain constrained on reasoning problems that require integrating fine\-grained evidence across modalities\[[18](https://arxiv.org/html/2605.22012#bib.bib46),[40](https://arxiv.org/html/2605.22012#bib.bib17)\]\. This gap matters because audio\-visual understanding depends not only on recognizing individual signals, but also on reasoning over their temporal and semantic interactions\.

We identify a key bottleneck in how current MLLMs perform reasoning\. Most existing approaches rely on explicit or structured text\-based chain\-of\-thought \(CoT\)\[[38](https://arxiv.org/html/2605.22012#bib.bib13),[36](https://arxiv.org/html/2605.22012#bib.bib14),[28](https://arxiv.org/html/2605.22012#bib.bib15),[56](https://arxiv.org/html/2605.22012#bib.bib16)\], which maps high\-dimensional audio\-visual evidence into discrete text tokens\. This textual bottleneck compresses away temporally aligned details and encourages the model to lean on language priors rather than native sensory evidence during reasoning\. As illustrated in Figure[1](https://arxiv.org/html/2605.22012#S1.F1), pure explicit text CoT therefore tends to under\-attend to the original audio\-visual inputs, limiting the model’s ability to exploit fine\-grained cross\-modal cues such as temporal synchronization\.

![Refer to caption](https://arxiv.org/html/2605.22012v1/x1.png)Figure 1:Comparison between LatentOmni and the Explicit Text CoT baseline \(detailed in[4\.1](https://arxiv.org/html/2605.22012#S4.SS1)\)\. \(Left\) Qualitatively, unlike the baseline, LatentOmni accurately anchors on key audio\-visual \(AV\) clues \(indicated by heatmaps\) to answer correctly\. \(Right\) Quantitatively, it maintains a significantly higher AV token attention ratio across tasks on the Daily\-Omni benchmark, ensuring robust grounding of original modalities\.We argue that this bottleneck can be mitigated by preserving part of the reasoning process in continuous latent space, where fine\-grained audio\-visual features are more directly retained than in discretized textual explanations\. Motivated by this perspective, we proposeLatentOmni, a post\-training framework that interleaves textual reasoning with audio\-visual latent states in a unified latent space\. To keep reasoning grounded in the original modalities, LatentOmni introduces feature\-level supervision that aligns latent reasoning states with task\-relevant audio\-visual segments, encouraging the model to retain and attend to native sensory evidence throughout the reasoning process\. To preserve temporal consistency across modalities, we further introduce Omni\-Sync Position Embedding \(OSPE\), which extends the time\-aligned multimodal RoPE\[[42](https://arxiv.org/html/2605.22012#bib.bib43)\]to synchronized latent audio and visual features\. Together, these designs enable latent states to serve as a dense bridge between audio, vision, and text while retaining the structural benefits of textual reasoning\.

Implementing feature\-level supervision within the latent space requires CoT data with pre\-annotated, reasoning\-relevant audio\-visual segments, a form of supervision largely missing from current audio\-visual instruction datasets\. These datasets typically provide coarse question\-answer pairs or textual rationales, without localizing the visual frames and audio intervals that support each reasoning step\. To fill this gap, we develop a scalable data curation pipeline featuring audio\-video interleaved reasoning trajectory and constructLatentOmni\-Instruct\-35K, a high\-quality dataset specifically tailored for cross\-modal reasoning tasks\.

As illustrated in Fig\.[1](https://arxiv.org/html/2605.22012#S1.F1), compared to purely explicit CoT reasoning methods, LatentOmni substantially improves attention to the original audio\-visual \(AV\) modalities, particularly on AV alignment tasks\. Furthermore, extensive experiments demonstrate that LatentOmni achieves the best results among the evaluated open\-source models on all four benchmarks, outperforming both the base model and the explicit text CoT baseline by a clear margin\. In brief, our contributions are summarized as follows:

- •We proposeLatentOmni, a novel audio\-visual reasoning framework that equips MLLMs with a tailored post\-training pipeline to conduct joint reasoning in a unified latent space\.
- •We introduce explicit feature\-level supervision in latent space and Omni\-Sync Position Embedding \(OSPE\) to facilitate cross\-modal temporal alignment, which efficiently preserves attention to audio\-visual modalities and bridges audio\-visual with textual semantics\.
- •We develop a novel audio\-visual interleaved CoT data synthesis pipeline, and constructLatentOmni\-Instruct\-35K, a high\-quality dataset filling the gap in tailored training data for complex cross\-modal latent reasoning\.
- •Our extensive experiments show that LatentOmni substantially outperforms the Explicit Text CoT baseline and achieves state\-of\-the\-art open\-source performance on challenging benchmarks, confirming its substantial promise for robust multimodal understanding\.

## 2Related work

### 2\.1Multimodal Large Language Models Reasoning

Multimodal Large Language Models \(MLLMs\) originally aimed to equip LLMs with diverse perceptual capabilities\[[11](https://arxiv.org/html/2605.22012#bib.bib18),[19](https://arxiv.org/html/2605.22012#bib.bib19),[29](https://arxiv.org/html/2605.22012#bib.bib61),[37](https://arxiv.org/html/2605.22012#bib.bib5)\]; however, to tackle complex real\-world tasks, research has progressively shifted toward enhancing their reasoning abilities\. A prevailing paradigm to achieve this is leveraging explicit chain techniques\[[36](https://arxiv.org/html/2605.22012#bib.bib14),[28](https://arxiv.org/html/2605.22012#bib.bib15),[39](https://arxiv.org/html/2605.22012#bib.bib22),[23](https://arxiv.org/html/2605.22012#bib.bib24),[34](https://arxiv.org/html/2605.22012#bib.bib6)\]\. By establishing text as the primary semantic bridge for cross\-modal integration, these models can effectively decompose complex tasks via natural language rationales\[[8](https://arxiv.org/html/2605.22012#bib.bib23)\]\. This text\-centric reasoning approach has demonstrated encouraging progress in individual visual and audio domains, and has now naturally extended to drive recent omnimodal frameworks like Gemini\[[33](https://arxiv.org/html/2605.22012#bib.bib21)\], Video\-LLaMA series\[[51](https://arxiv.org/html/2605.22012#bib.bib20)\], and the Qwen\-Omni series\[[42](https://arxiv.org/html/2605.22012#bib.bib43)\]\.

Despite its widespread adoption, recent research reveals that this discrete reasoning paradigm fundamentally constrains complex cross\-modal inference\[[24](https://arxiv.org/html/2605.22012#bib.bib29),[55](https://arxiv.org/html/2605.22012#bib.bib63)\]\. Forcing high\-dimensional audio\-visual signals through a narrow textual bottleneck inevitably causes information loss\. Furthermore, this text\-centric abstraction results in insufficient attention to raw audio\-visual signals\. This imbalance leads to sensory detachment and multimodal hallucinations, where generated rationales decouple from the actual underlying evidence\[[26](https://arxiv.org/html/2605.22012#bib.bib31),[9](https://arxiv.org/html/2605.22012#bib.bib27)\]\. Although recent tool\-augmented approaches \(e\.g\., think with audio, image and video\)\[[41](https://arxiv.org/html/2605.22012#bib.bib25),[31](https://arxiv.org/html/2605.22012#bib.bib32),[52](https://arxiv.org/html/2605.22012#bib.bib26),[44](https://arxiv.org/html/2605.22012#bib.bib7)\]attempt to mitigate this, they fail to fundamentally resolve the inherent neglect of cross\-modal inputs\. Consequently, these limitations severely impede the scalability of explicit CoT reasoning\[[16](https://arxiv.org/html/2605.22012#bib.bib30)\]\.

### 2\.2Reasoning in Latent Space

To mitigate the constraints of discrete token generation, recent studies have explored conducting reasoning directly within continuous latent spaces\[[12](https://arxiv.org/html/2605.22012#bib.bib39),[13](https://arxiv.org/html/2605.22012#bib.bib33),[49](https://arxiv.org/html/2605.22012#bib.bib40)\]\. As a pioneering work in this direction, Coconut\[[13](https://arxiv.org/html/2605.22012#bib.bib33)\]bypasses the autoregressive generation of intermediate textual tokens by executing reasoning steps entirely within the model’s hidden states\. This continuous reasoning paradigm has subsequently been extended to the multimodal domain to better accommodate continuous real\-world sensory signals\[[2](https://arxiv.org/html/2605.22012#bib.bib42)\]\. In this context, current research generally follows two mainstream methodologies: some works design specific training frameworks\[[17](https://arxiv.org/html/2605.22012#bib.bib35),[35](https://arxiv.org/html/2605.22012#bib.bib36),[22](https://arxiv.org/html/2605.22012#bib.bib37)\]to optimize reasoning trajectories within the latent space, while others develop training\-free inference mechanisms\[[20](https://arxiv.org/html/2605.22012#bib.bib38)\]to elicit latent reasoning capabilities directly from pre\-trained representations\.

Despite these advances, existing latent reasoning methods predominantly focus on pure text or single\-modality extensions, such as visual\-textual integration\[[35](https://arxiv.org/html/2605.22012#bib.bib36),[17](https://arxiv.org/html/2605.22012#bib.bib35),[20](https://arxiv.org/html/2605.22012#bib.bib38),[27](https://arxiv.org/html/2605.22012#bib.bib41)\]\. The joint comprehension and reasoning of dynamic Audio\-Visual \(AV\) signals within a unified continuous space remains underexplored\. Recognizing this gap, our work introduces LatentOmni to extend continuous latent reasoning to omnimodal scenarios, explicitly addressing the temporal and semantic alignment of cross\-modal AV integration\.

## 3Method

We present LatentOmni, a post\-training framework for audio\-visual reasoning in a unified latent space\. As illustrated in Fig\.[2](https://arxiv.org/html/2605.22012#S3.F2), the framework combines interleaved text\-latent reasoning, synchronized audio\-visual latent representations, a dedicated interleaved reasoning dataset, and training objectives that ground latent states in native sensory evidence\. We first describe the reasoning process and latent representation design, then present the data synthesis pipeline and the training objectives\.

![Refer to caption](https://arxiv.org/html/2605.22012v1/x2.png)Figure 2:Overview of LatentOmni\. Left: the model alternates between textual generation and latent reasoning\. Right: training combines text prediction, latent alignment, and temporal synchronization objectives\.### 3\.1Audio\-Visual Latent Reasoning

Text\-only CoT provides useful logical structure, but it is inefficient for revisiting dense audio\-visual evidence\. LatentOmni therefore alternates between explicit textual deduction and latent reasoning phases that operate directly on continuous audio\-visual states\. Given encoded visual featuresHvH^\{v\}, audio featuresHaH^\{a\}, and a textual queryHqH^\{q\}, the model autoregressively generates a hybrid sequence of text tokens and latent states\. When it needs to revisit audio\-visual evidence, it emits a special token<𝚄𝚗𝚒𝚏𝚒𝚎𝚍​\_​𝙻𝚊𝚝𝚎𝚗𝚝\>\\mathtt\{<Unified\\\_Latent\>\}, which switches decoding from the discrete vocabulary space𝒱\\mathcal\{V\}to a continuous latent spaceℝd\\mathbb\{R\}^\{d\}\. After generatingKKlatent embeddings, we explicitly insert a stop token</𝚄𝚗𝚒𝚏𝚒𝚎𝚍\_𝙻𝚊𝚝𝚎𝚗𝚝\>\\mathtt\{</Unified\\\_Latent\>\}to terminate the continuous reasoning phase and revert to explicit textual generation\. The resulting reasoning trajectory is

S=\[w1:i,u,z1:K,u′,wi\+1:j,u,zK\+1:2​K,u′,…,a\],S=\\left\[w\_\{1:i\},u,z\_\{1:K\},u^\{\\prime\},w\_\{i\+1:j\},u,z\_\{K\+1:2K\},u^\{\\prime\},\\dots,a\\right\],\(1\)wherewwdenotes text tokens,uuis the<𝚄𝚗𝚒𝚏𝚒𝚎𝚍​\_​𝙻𝚊𝚝𝚎𝚗𝚝\>\\mathtt\{<Unified\\\_Latent\>\}trigger,u′u^\{\\prime\}is the inserted</𝚄𝚗𝚒𝚏𝚒𝚎𝚍\_𝙻𝚊𝚝𝚎𝚗𝚝\>\\mathtt\{</Unified\\\_Latent\>\}stop token,zzdenotes continuous latent reasoning states, andaais the final answer\. This design keeps text as the scaffold for high\-level logic while reserving latent states for evidence\-intensive cross\-modal reasoning\. We analyze the effect of the latent lengthKKin Section[4\.3](https://arxiv.org/html/2605.22012#S4.SS3)\.

### 3\.2Unified Latent Representation and Temporal Alignment

A remaining design question is how to represent latent reasoning states while preserving temporal correspondence across modalities\. During each latent reasoning phase triggered byuu, the model generates a sequence of continuous states auto\-regressively\. At thekk\-th latent step, the latent representationzk∈ℝdz\_\{k\}\\in\\mathbb\{R\}^\{d\}is instantiated as the last\-layer hidden state of the transformer backbone prior to the language modeling head \(Fig\.[2](https://arxiv.org/html/2605.22012#S3.F2), left\):

zk=LMθ\(L\)⁡\(Hv,Ha,Hq,S<k\),z\_\{k\}=\\operatorname\{LM\}\_\{\\theta\}^\{\(L\)\}\\left\(H^\{v\},H^\{a\},H^\{q\},S\_\{<k\}\\right\),\(2\)whereLLdenotes the number of transformer layers andS<kS\_\{<k\}is the preceding mixed context of text tokens and latent states\. Each generatedzkz\_\{k\}is then fed back as the input embedding for the next latent step, forming a continuous reasoning trajectory of lengthKK\. We allocate the firstKvK\_\{v\}positions to visual latents and the remainingKaK\_\{a\}positions to audio latents, which lets the model control modality\-specific capacity while keeping all latent states in the same continuous spaceℝd\\mathbb\{R\}^\{d\}\.

Sequential generation, however, creates a mismatch risk: audio and visual latents that refer to the same moment may drift apart positionally\. To prevent this, we introduce Omni\-Sync Position Embedding \(OSPE\), which extends the time\-aligned multimodal RoPE from Qwen2\.5\-Omni\[[42](https://arxiv.org/html/2605.22012#bib.bib43)\]to the unified latent space\. OSPE assigns a shared physical timestampttto temporally corresponding visual frames and audio segments\. For a latent featureh∈\{hv,ha\}h\\in\\\{h^\{v\},h^\{a\}\\\}at timestamptt, OSPE applies

OSPE⁡\(h,t\)=h⊙cos⁡\(t​Θ\)\+ℛ​\(h\)⊙sin⁡\(t​Θ\),\\operatorname\{OSPE\}\(h,t\)=h\\odot\\cos\(t\\Theta\)\+\\mathcal\{R\}\(h\)\\odot\\sin\(t\\Theta\),\(3\)wherehvh^\{v\}andhah^\{a\}denote latent visual and audio features,Θ=\{θi\}i=1d/2\\Theta=\\\{\\theta\_\{i\}\\\}\_\{i=1\}^\{d/2\}is the base frequency vector,⊙\\odotdenotes the Hadamard product, andℛ​\(⋅\)\\mathcal\{R\}\(\\cdot\)is the block\-diagonal rotation matrix over adjacent feature dimensions\. By injecting a synchronized positional prior, OSPE aligns sequentially generated latent features that correspond to the same time window, allowing later reasoning steps to attend to temporally consistent cross\-modal evidence\.

![Refer to caption](https://arxiv.org/html/2605.22012v1/x3.png)Figure 3:Construction pipeline of LatentOmni\-Instruct\-35K\.
### 3\.3LatentOmni\-Instruct\-35K Dataset Construction

Latent\-space reasoning requires supervision beyond standard question\-answer pairs: the model must know which local audio\-visual evidence should be revisited at each step\. Existing datasets rarely provide such segment\-grounded interleaved trajectories\. We therefore buildLatentOmni\-Instruct\-35Kthrough a three\-stage pipeline, shown in Fig\.[3](https://arxiv.org/html/2605.22012#S3.F3), consisting of AVQA synthesis and filtering, segment\-level caption synthesis, and audio\-visual interleaved reasoning trajectory synthesis\.

AVQA Data Synthesis & Filtering\.We first collect raw samples from two temporally aligned audio\-visual caption datasets, ASID\[[21](https://arxiv.org/html/2605.22012#bib.bib53)\]and AVoCaDO\[[3](https://arxiv.org/html/2605.22012#bib.bib3)\], and use Qwen3\-235B\-A22B\[[45](https://arxiv.org/html/2605.22012#bib.bib54)\]to transform cross\-modal captions into preliminary question\-answer pairs\. During generation, the model is instructed to produce questions that require cross\-modal dependency, cover diverse reasoning types, and preserve answer correctness\. We then use GLM\-4\.7\[[50](https://arxiv.org/html/2605.22012#bib.bib55)\]to assign each pair a category and three quality scores: difficulty, logical soundness, and modality dependency\. Samples with a total score below 13 are discarded, and the ratio between any two adjacent categories is constrained to be within3×3\\timesto avoid severe imbalance\. This stage yields a higher\-quality AVQA pool with stronger logical rigor and modality coupling\. Prompts are provided in Appendices[A\.2](https://arxiv.org/html/2605.22012#A1.SS2)and[A\.3](https://arxiv.org/html/2605.22012#A1.SS3)\.

Segment\-Level Caption Synthesis\.Each retained sample also needs localized audio and visual evidence\. We therefore segment the raw streams by timestamp and generate segment\-level descriptions\. Because joint audio\-visual captions often omit one modality\[[3](https://arxiv.org/html/2605.22012#bib.bib3)\], we use Qwen3\-30B\-A3B\-Captioner\[[45](https://arxiv.org/html/2605.22012#bib.bib54)\]to produce separate audio and video captions for each segment\. Using the original aligned source captions as references, GLM\-4\.7 then filters hallucinated descriptions, repairs shot fragmentation, and realigns the audio and video captions in time\. The result is a set of segment\-level captions that are both locally grounded and cross\-modally aligned\. Prompts are provided in Appendices[A\.4](https://arxiv.org/html/2605.22012#A1.SS4)and[A\.5](https://arxiv.org/html/2605.22012#A1.SS5)\.

Audio\-Visual Interleaved Reasoning Trajectory Synthesis\.Finally, we synthesize full reasoning trajectories from the filtered AVQA pairs and segment\-level captions\. GLM\-4\.7 generates reasoning chains that insert explicit markers whenever a step requires a specific audio\-visual segment\. Gemini\-2\.5\-Flash then audits these trajectories by correcting citation errors and removing redundant or inconsistent branches\. After discarding trajectories with major hallucinations or contradictions, we replace the markers with their corresponding audio\-visual segments to obtain the final 35K\-sample dataset\.

### 3\.4LatentOmni Training

Our training objective must satisfy three requirements simultaneously: preserve temporal correspondence between audio and vision, ground latent states in native sensory evidence, and retain the model’s language\-generation ability\. We therefore perform supervised fine\-tuning on LatentOmni using the audio\-visual interleaved CoT dataset from Sec\.[3\.3](https://arxiv.org/html/2605.22012#S3.SS3)and optimize three complementary objectives over the hybrid reasoning trajectory\.

Before asking the model to reason over joint latent states, we first align synchronized audio and visual evidence in the shared space through atemporal synchronization objective \(ℒsync\\mathcal\{L\}\_\{\\text\{sync\}\}\)\. Given latent visual featureshtvh\_\{t\}^\{v\}and audio featureshtah\_\{t\}^\{a\}at matching timestampst∈𝒯t\\in\\mathcal\{T\}, we optimize a symmetric InfoNCE contrastive loss:

ℒsync=−12​\|𝒯\|​∑t∈𝒯\(log⁡exp⁡\(sim⁡\(htv,hta\)/τ\)∑t′exp⁡\(sim⁡\(htv,ht′a\)/τ\)\+log⁡exp⁡\(sim⁡\(hta,htv\)/τ\)∑t′exp⁡\(sim⁡\(hta,ht′v\)/τ\)\),\\mathcal\{L\}\_\{\\text\{sync\}\}=\-\\frac\{1\}\{2\|\\mathcal\{T\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\}\\left\(\\log\\frac\{\\exp\\left\(\\operatorname\{sim\}\(h\_\{t\}^\{v\},h\_\{t\}^\{a\}\)/\\tau\\right\)\}\{\\sum\_\{t^\{\\prime\}\}\\exp\\left\(\\operatorname\{sim\}\(h\_\{t\}^\{v\},h\_\{t^\{\\prime\}\}^\{a\}\)/\\tau\\right\)\}\+\\log\\frac\{\\exp\\left\(\\operatorname\{sim\}\(h\_\{t\}^\{a\},h\_\{t\}^\{v\}\)/\\tau\\right\)\}\{\\sum\_\{t^\{\\prime\}\}\\exp\\left\(\\operatorname\{sim\}\(h\_\{t\}^\{a\},h\_\{t^\{\\prime\}\}^\{v\}\)/\\tau\\right\)\}\\right\),\(4\)wheresim⁡\(⋅,⋅\)\\operatorname\{sim\}\(\\cdot,\\cdot\)denotes cosine similarity andτ\\tauis a learnable temperature\. This loss pulls together temporally co\-occurring audio\-visual features while pushing apart asynchronous pairs, thereby establishing a temporally coherent latent space before deeper reasoning takes place\.

Temporal alignment alone, however, does not guarantee that latent reasoning remains attached to the source evidence\. To counter the language\-bound tendency identified in Sec\.[1](https://arxiv.org/html/2605.22012#S1), we additionally ground each auto\-regressively generated latent embeddingzkz\_\{k\}in raw sensory features\. For each annotated audio\-visual segment, we extract features using the model’s visual and audio encoders and compress them into a dense anchor sequenceA=\[a1,…,aK\]A=\[a\_\{1\},\\dots,a\_\{K\}\], consisting ofKvK\_\{v\}visual andKaK\_\{a\}audio anchors \(K=Kv\+KaK=K\_\{v\}\+K\_\{a\}\)\. We use parameter\-free L2\-norm\-weighted pooling for this compression so that salient transient actions and acoustic events are preserved\. As reasoning unfolds auto\-regressively, each generated statezkz\_\{k\}is aligned with its corresponding anchoraka\_\{k\}using a latent alignment loss:

ℒlatent=1K​∑k=1K‖zk−ak‖22\.\\mathcal\{L\}\_\{\\text\{latent\}\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\left\\\|z\_\{k\}\-a\_\{k\}\\right\\\|^\{2\}\_\{2\}\.\(5\)
Latent supervision should not come at the expense of the model’s linguistic priors\. We therefore apply a standardnext\-token prediction loss \(ℒtext\\mathcal\{L\}\_\{\\text\{text\}\}\)over all discrete tokens in the hybrid sequence\. Given a reasoning sequenceS=\{s1,s2,…,sL\}S=\\\{s\_\{1\},s\_\{2\},\\dots,s\_\{L\}\\\}containing both text tokens and continuous latent states, we compute the auto\-regressive cross\-entropy loss only on the elements that belong to the vocabulary𝒱\\mathcal\{V\}:

ℒtext=−1Ntext​∑t=1L𝕀​\(st∈𝒱\)​log⁡p​\(st∣S<t,Hv,Ha,Hq\),\\mathcal\{L\}\_\{\\text\{text\}\}=\-\\frac\{1\}\{N\_\{\\text\{text\}\}\}\\sum\_\{t=1\}^\{L\}\\mathbb\{I\}\(s\_\{t\}\\in\\mathcal\{V\}\)\\log p\(s\_\{t\}\\mid S\_\{<t\},H^\{v\},H^\{a\},H^\{q\}\),\(6\)where𝕀​\(⋅\)\\mathbb\{I\}\(\\cdot\)is the indicator function,NtextN\_\{\\text\{text\}\}is the number of discrete tokens \(including text reasoning tokensww, the trigger tokenuu, and the final answeraa\), andS<tS\_\{<t\}denotes the preceding hybrid context\. This preserves the model’s ability to perform explicit textual deduction while conditioning each token on the interleaved history of text and latent evidence\.

The model is optimized end\-to\-end with the combined objective function:

ℒtotal=ℒtext\+λ1​ℒlatent\+λ2​ℒsync,\\mathcal\{L\}\_\{\\text\{total\}\}=\\mathcal\{L\}\_\{\\text\{text\}\}\+\\lambda\_\{1\}\\mathcal\{L\}\_\{\\text\{latent\}\}\+\\lambda\_\{2\}\\mathcal\{L\}\_\{\\text\{sync\}\},\(7\)whereλ1\\lambda\_\{1\}andλ2\\lambda\_\{2\}are balancing hyperparameters\. The final objective jointly balances textual fluency, modality grounding, and temporal alignment, enabling LatentOmni to reason with continuous audio\-visual evidence without abandoning the structural benefits of language\.

## 4Experiments

### 4\.1Experimental Setup

Training\.Following the pipeline in Section[3\.4](https://arxiv.org/html/2605.22012#S3.SS4), we train LatentOmni from Qwen2\.5\-Omni\-7B using LatentOmni\-Instruct\-35K \(Section[3\.3](https://arxiv.org/html/2605.22012#S3.SS3)\)\. We fine\-tune the model for 750 steps \(2 epochs\), so the comparison mainly reflects the effect of the proposed post\-training objective rather than a change in backbone scale\. Unless otherwise stated, both training and evaluation use a fixed budget of 40 latent tokens, selected by ablating the total token count and the audio\-visual allocation ratio\. This fixed setting keeps the inference interface identical across examples and avoids per\-sample tuning of the latent length\. It is also consistent with prior observations that fixed latent budgets are more stable than dynamic schedules in practical reasoning settings\[[17](https://arxiv.org/html/2605.22012#bib.bib35)\]\.

Benchmarks\.We evaluate audio\-visual joint reasoning on four omnimodal benchmarks that stress complementary capabilities: everyday scenario reasoning \(Daily\-Omni\[[57](https://arxiv.org/html/2605.22012#bib.bib44)\]\), physical and spatial\-temporal commonsense \(WorldSense\[[14](https://arxiv.org/html/2605.22012#bib.bib45)\]\), cross\-modal alignment and question answering \(OmniVideoBench\[[18](https://arxiv.org/html/2605.22012#bib.bib46)\]\), and long\-form multi\-sensory understanding \(LVOmniBench\[[32](https://arxiv.org/html/2605.22012#bib.bib47)\]\)\. This benchmark suite is intended to test whether latent reasoning helps beyond a single data regime: Daily\-Omni emphasizes common event understanding, WorldSense tests structured commonsense over time and space, OmniVideoBench contains fine\-grained audio\-type and video\-duration splits, and LVOmniBench stresses sustained reasoning over longer inputs\.

Baselines\.We organize baselines to match the analysis order in Section[4\.2](https://arxiv.org/html/2605.22012#S4.SS2)\. First, we compare with representative open\-source audio\-visual MLLMs, including VideoLLaMA2\-7B\[[5](https://arxiv.org/html/2605.22012#bib.bib64)\], MiniCPM\-o\-7B\[[47](https://arxiv.org/html/2605.22012#bib.bib49)\], VITA\-1\.5\-7B\[[10](https://arxiv.org/html/2605.22012#bib.bib50)\], HumanOmniV2\-7B\[[46](https://arxiv.org/html/2605.22012#bib.bib52)\], Baichuan\-Omni\-1\.5, OmniVinci, and the Qwen2\.5\-Omni\-7B base model\[[42](https://arxiv.org/html/2605.22012#bib.bib43)\]\. Second, we isolate the effect of latent reasoning from text\-only reasoning and ordinary fine\-tuning under the same backbone\.Explicit Text CoTremoves all interleaved audio\-video segments from LatentOmni\-Instruct\-35K and fine\-tunes Qwen2\.5\-Omni\-7B on strictly textual reasoning trajectories, whileVanilla SFTdirectly fine\-tunes Qwen2\.5\-Omni\-7B on LatentOmni\-Instruct\-35K without latent\-space reasoning\. This pair of controls separates three factors that are otherwise easy to conflate: additional instruction data, explicit textual rationales, and continuous audio\-visual latent states\. Third, we compare with recent visual latent reasoning methods, Monet\[[35](https://arxiv.org/html/2605.22012#bib.bib36)\]and LVR\[[17](https://arxiv.org/html/2605.22012#bib.bib35)\], under their vision\-only setting\. We also report proprietary systems, including GPT\-4o\[[15](https://arxiv.org/html/2605.22012#bib.bib58)\], Gemini\-2\.0\-Flash, Gemini\-2\.5\-Pro\[[6](https://arxiv.org/html/2605.22012#bib.bib56)\], and Gemini\-3\-Pro\[[25](https://arxiv.org/html/2605.22012#bib.bib57)\], as reference points rather than directly controlled baselines\.

Table 1:Performance on four omnimodal benchmarks\. Proprietary systems are included as reference points; the best result among open\-source models and Qwen2\.5\-Omni variants is highlighted\.Table 2:Accuracy comparison on OmniVideoBench\. Closed\-source systems are reported as reference points; within open\-source rows, thebestresult is highlighted and the second\-best isunderlined\. The gain of LatentOmni over the base model is shown in red parentheses\.
### 4\.2Main Results

Table[1](https://arxiv.org/html/2605.22012#S4.T1)summarizes the main results on four omnimodal benchmarks\. We report proprietary systems for context, but focus the controlled comparison on open\-source models, text\-only reasoning variants, and latent reasoning baselines\. Overall, LatentOmni achieves the best performance among the evaluated open\-source methods on all four benchmarks, supporting the effectiveness of unified latent\-space reasoning for audio\-visual tasks\.

Table 3:Comparison with recent visual latent reasoning methods on VideoMME under the vision\-only protocol used by prior work\.Table 4:Ablation of the components of LatentOmni\. Thebestis highlighted\.Comparison with Open\-Source Models\.LatentOmni consistently improves over existing open\-source audio\-visual models\. Compared with its base model, Qwen2\.5\-Omni\-7B, LatentOmni obtains absolute gains of 4\.5% on Daily\-Omni, 3\.5% on WorldSense, 6\.1% on OmniVideoBench, and 3\.1% on LVOmniBench\. It also outperforms strong open\-source competitors such as OmniVinci and HumanOmniV2\-7B on the benchmarks where they report results\. The improvement is especially clear on OmniVideoBench, where LatentOmni reaches 35\.4% and surpasses all evaluated 7B open\-source models, indicating stronger cross\-modal alignment and reasoning\.

Comparison with Text CoT\.We next compare LatentOmni with text\-only and standard fine\-tuning variants built on the same base model\. Although Explicit Text CoT improves Qwen2\.5\-Omni\-7B, LatentOmni further raises accuracy by 1\.8% on Daily\-Omni, 2\.3% on WorldSense, 2\.2% on OmniVideoBench, and 3\.0% on LVOmniBench\. Relative to Vanilla SFT, LatentOmni also yields gains on all datasets, with the largest improvements on Daily\-Omni \(\+5\.4%\) and OmniVideoBench \(\+4\.9%\)\. These controlled comparisons suggest that the gain does not come merely from additional instruction tuning or textual rationales, but from preserving reasoning\-relevant audio\-visual evidence in latent states\.

Comparison with Latent Reasoning Methods\.We further compare with recent visual latent reasoning methods, LVR and Monet, on VideoMME\. Because these methods are vision\-centric, we follow a vision\-only protocol without audio inputs\. As shown in Table[3](https://arxiv.org/html/2605.22012#S4.T3), LatentOmni achieves the highest overall score \(60\.8\) and leads across short, medium, and long videos\. This result suggests that the proposed latent reasoning design remains effective even when evaluated outside the full audio\-visual setting\.

Fine\-Grained OmniVideoBench Analysis\.Table[2](https://arxiv.org/html/2605.22012#S4.T2)provides a more detailed view of cross\-modal reasoning behavior\. Among open\-source methods, LatentOmni achieves the highest average accuracy \(35\.4%\), improving over the base model by 6\.1pp\. It leads on music and speech questions, all short\-to\-medium duration buckets, and ties for the best score on the longest videos \(\(10,30\] min\)\. Compared with Explicit Text CoT, LatentOmni improves the average accuracy by 2\.2pp and shows a clear advantage on long\-form video reasoning \(34\.0% vs\. 30\.7% on the longest subset\), supporting the benefit of synchronized continuous latent states for sustained audio\-visual understanding\.

### 4\.3Ablation Study

We ablate the main design choices of LatentOmni to identify where the gains come from\. Specifically, we examine the modality composition of the unified latent space, the role of OSPE, the latent sequence configuration, and the individual contributions ofℒlatent\\mathcal\{L\}\_\{\\text\{latent\}\}andℒsync\\mathcal\{L\}\_\{\\text\{sync\}\}\. Unless otherwise noted, ablations follow the same evaluation protocol as the main experiments\.

Component Analysis\.Table[4](https://arxiv.org/html/2605.22012#S4.T4)shows that removing either audio or visual features from the latent space consistently degrades performance, confirming that both modalities contribute to the final reasoning trajectory\. Removing OSPE also reduces accuracy on every benchmark \(e\.g\.,67\.4→66\.067\.4\\rightarrow 66\.0on Daily\-Omni and35\.1→33\.135\.1\\rightarrow 33\.1on LVOmniBench\), supporting the importance of cross\-modal temporal alignment\. Among the training objectives,ℒlatent\\mathcal\{L\}\_\{\\text\{latent\}\}is the most influential: without it, performance drops sharply to 61\.0 on Daily\-Omni and 31\.8 on OmniVideoBench\. Ablatingℒsync\\mathcal\{L\}\_\{\\text\{sync\}\}yields smaller but consistent losses, indicating that temporal synchronization complements latent grounding rather than replacing it\.

![Refer to caption](https://arxiv.org/html/2605.22012v1/x4.png)Figure 4:Ablation studies on latent configurations across three benchmarks, specifically evaluating the impact of latent token counts and the allocation ratio of audio and visual latents\.Impact of Latent Token Configuration\.Figure[4](https://arxiv.org/html/2605.22012#S4.F4)further studies the length and modality allocation of latent reasoning trajectories\. Scaling the total number of latent tokens shows an empirical optimum at 40 tokens: shorter sequences appear to limit representational capacity, while longer sequences add computation without consistent gains\. With the length fixed to 40, allocating 32 tokens to visual latents and 8 tokens to audio latents achieves the best overall performance\. These results support our default configuration and suggest that audio\-visual latent reasoning benefits from a larger visual budget while still requiring a dedicated audio allocation\.

## 5Conclusion

This paper addresses audio\-visual reasoning in MLLMs by proposingLatentOmni, a framework that interleaves explicit textual reasoning with synchronized latent audio\-visual states\. The key idea is to keep intermediate reasoning grounded in native sensory evidence rather than forcing every step through a text\-only bottleneck\. To this end, we introduce feature\-level latent supervision, Omni\-Sync Position Embedding \(OSPE\) for cross\-modal temporal alignment, andLatentOmni\-Instruct\-35Kfor supervising audio\-visual interleaved reasoning trajectories\. Across four omnimodal benchmarks, LatentOmni consistently improves over both Qwen2\.5\-Omni\-7B and an Explicit Text CoT baseline, and achieves the best performance among the evaluated open\-source models\. These results demonstrate the promise of latent\-space joint reasoning as a practical and effective path toward more faithful omnimodal understanding\.

## References

- \[1\]\(2018\)Multimodal machine learning: a survey and taxonomy\.IEEE transactions on pattern analysis and machine intelligence41\(2\),pp\. 423–443\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1)\.
- \[2\]A\. Bardes, Q\. Garrido, J\. Ponce, X\. Chen, M\. Rabbat, Y\. LeCun, M\. Assran, and N\. Ballas\(2024\)Revisiting feature prediction for learning visual representations from video\.arXiv preprint arXiv:2404\.08471\.Cited by:[§2\.2](https://arxiv.org/html/2605.22012#S2.SS2.p1.1)\.
- \[3\]X\. Chen, Y\. Ding, W\. Lin, J\. Hua, L\. Yao, Y\. Shi, B\. Li, Y\. Zhang, Q\. Liu, P\. Wan,et al\.\(2025\)Avocado: an audiovisual video captioner driven by temporal orchestration\.arXiv preprint arXiv:2510\.10395\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1),[§3\.3](https://arxiv.org/html/2605.22012#S3.SS3.p2.1),[§3\.3](https://arxiv.org/html/2605.22012#S3.SS3.p3.1)\.
- \[4\]X\. Chen, W\. Lin, J\. Hua, L\. Yao, Y\. Ding, B\. Li, B\. Zeng, Y\. Shi, Q\. Liu, Y\. Zhang,et al\.\(2026\)DiaDem: advancing dialogue descriptions in audiovisual video captioning for multimodal large language models\.arXiv preprint arXiv:2601\.19267\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1)\.
- \[5\]Z\. Cheng, S\. Leng, H\. Zhang, Y\. Xin, X\. Li, G\. Chen, Y\. Zhu, W\. Zhang, Z\. Luo, D\. Zhao,et al\.\(2024\)Videollama 2: advancing spatial\-temporal modeling and audio understanding in video\-llms\.arXiv preprint arXiv:2406\.07476\.Cited by:[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p3.1)\.
- \[6\]G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p3.1)\.
- \[7\]Y\. Ding, Y\. Ji, J\. Li, X\. Liu, X\. Chen, J\. Wu, B\. Li, B\. Zeng, Y\. Shi, Y\. Guan,et al\.\(2026\)OmniSIFT: modality\-asymmetric token compression for efficient omni\-modal large language models\.arXiv preprint arXiv:2602\.04804\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1)\.
- \[8\]Y\. Dong, Z\. Liu, H\. Sun, J\. Yang, W\. Hu, Y\. Rao, and Z\. Liu\(2025\)Insight\-v: exploring long\-chain visual reasoning with multimodal large language models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 9062–9072\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p1.1)\.
- \[9\]H\. Fang, J\. Li, J\. Kong, T\. Zhuang, K\. Gao, B\. Chen, S\. Xia, and Y\. Wang\(2026\)Seeing through the chain: mitigate hallucination in multimodal reasoning models via cot compression and contrastive preference optimization\.arXiv preprint arXiv:2602\.03380\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p2.1)\.
- \[10\]C\. Fu, H\. Lin, X\. Wang, Y\. Zhang, Y\. Shen, X\. Liu, H\. Cao, Z\. Long, H\. Gao, K\. Li,et al\.\(2025\)Vita\-1\.5: towards gpt\-4o level real\-time vision and speech interaction\.arXiv preprint arXiv:2501\.01957\.Cited by:[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p3.1)\.
- \[11\]R\. Girdhar, A\. El\-Nouby, Z\. Liu, M\. Singh, K\. V\. Alwala, A\. Joulin, and I\. Misra\(2023\)Imagebind: one embedding space to bind them all\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 15180–15190\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p1.1)\.
- \[12\]S\. Goyal, Z\. Ji, A\. S\. Rawat, A\. K\. Menon, S\. Kumar, and V\. Nagarajan\(2023\)Think before you speak: training language models with pause tokens\.arXiv preprint arXiv:2310\.02226\.Cited by:[§2\.2](https://arxiv.org/html/2605.22012#S2.SS2.p1.1)\.
- \[13\]S\. Hao, S\. Sukhbaatar, D\. Su, X\. Li, Z\. Hu, J\. Weston, and Y\. Tian\(2024\)Training large language models to reason in a continuous latent space\.arXiv preprint arXiv:2412\.06769\.Cited by:[§2\.2](https://arxiv.org/html/2605.22012#S2.SS2.p1.1)\.
- \[14\]J\. Hong, S\. Yan, J\. Cai, X\. Jiang, Y\. Hu, and W\. Xie\(2025\)Worldsense: evaluating real\-world omnimodal understanding for multimodal llms\.arXiv preprint arXiv:2502\.04326\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p2.1)\.
- \[15\]A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p3.1)\.
- \[16\]S\. S\. Kancheti, A\. S\. Kanade, V\. N\. Balasubramanian, and T\. Ganu\(2026\)Chain\-of\-thought degrades visual spatial reasoning capabilities of multimodal llms\.arXiv preprint arXiv:2604\.16060\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p2.1)\.
- \[17\]B\. Li, X\. Sun, J\. Liu, Z\. Wang, J\. Wu, X\. Yu, H\. Chen, E\. Barsoum, M\. Chen, and Z\. Liu\(2025\)Latent visual reasoning\.arXiv preprint arXiv:2509\.24251\.Cited by:[§2\.2](https://arxiv.org/html/2605.22012#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2605.22012#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p3.1)\.
- \[18\]C\. Li, Y\. Chen, Y\. Ji, J\. Xu, Z\. Cui, S\. Li, Y\. Zhang, W\. Wang, Z\. Song, D\. Zhang,et al\.\(2025\)Omnivideobench: towards audio\-visual understanding evaluation for omni mllms\.arXiv preprint arXiv:2510\.10689\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p2.1)\.
- \[19\]J\. Li, D\. Li, S\. Savarese, and S\. Hoi\(2023\)Blip\-2: bootstrapping language\-image pre\-training with frozen image encoders and large language models\.InInternational conference on machine learning,pp\. 19730–19742\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p1.1)\.
- \[20\]K\. Li, C\. Shang, L\. Karlinsky, R\. Feris, T\. Darrell, and R\. Herzig\(2025\)Latent implicit visual reasoning\.arXiv preprint arXiv:2512\.21218\.Cited by:[§2\.2](https://arxiv.org/html/2605.22012#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2605.22012#S2.SS2.p2.1)\.
- \[21\]Y\. Li, H\. Zhang, M\. Guo, W\. Gao, S\. Jia, S\. Jiao, Q\. Hou, and M\. Cheng\(2026\)Towards universal video mllms with attribute\-structured and quality\-verified instructions\.arXiv preprint arXiv:2602\.13013\.Cited by:[§3\.3](https://arxiv.org/html/2605.22012#S3.SS3.p2.1)\.
- \[22\]C\. Liu, Y\. Yang, Y\. Fan, Q\. Wei, S\. Liu, and X\. E\. Wang\(2025\)Reasoning within the mind: dynamic multimodal interleaving in latent space\.arXiv preprint arXiv:2512\.12623\.Cited by:[§2\.2](https://arxiv.org/html/2605.22012#S2.SS2.p1.1)\.
- \[23\]Z\. Ma, Z\. Chen, Y\. Wang, E\. Chng, and X\. Chen\(2025\)Audio\-cot: exploring chain\-of\-thought reasoning in large audio language model\.In2025 IEEE Automatic Speech Recognition and Understanding Workshop \(ASRU\),pp\. 1–6\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p1.1)\.
- \[24\]T\. Pham and C\. Ngo\(2025\)Multimodal chain of continuous thought for latent\-space reasoning in vision\-language models\.arXiv preprint arXiv:2508\.12587\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p2.1)\.
- \[25\]S\. Pichai, D\. Hassabis, and K\. Kavukcuoglu\(2025\)A new era of intelligence with gemini 3\.Google\. URL: https://blog\.google/products\-and\-platforms/products/gemini/gemini\-3/\.Cited by:[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p3.1)\.
- \[26\]Z\. Qian, Y\. Ma, Z\. Ouyang, Z\. Wang, Z\. Xu, F\. Luo, X\. Liu, Z\. Ge, Y\. Guo, and J\. Han\(2026\)Cognitive pivot points and visual anchoring: unveiling and rectifying hallucinations in multimodal reasoning models\.arXiv preprint arXiv:2604\.10219\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p2.1)\.
- \[27\]Y\. Qin, B\. Wei, J\. Ge, K\. Kallidromitis, S\. Fu, T\. Darrell, and X\. Wang\(2025\)Chain\-of\-visual\-thought: teaching vlms to see and think better with continuous visual tokens\.arXiv preprint arXiv:2511\.19418\.Cited by:[§2\.2](https://arxiv.org/html/2605.22012#S2.SS2.p2.1)\.
- \[28\]H\. Shao, S\. Qian, H\. Xiao, G\. Song, Z\. Zong, L\. Wang, Y\. Liu, and H\. Li\(2024\)Visual cot: advancing multi\-modal language models with a comprehensive dataset and benchmark for chain\-of\-thought reasoning\.Advances in Neural Information Processing Systems37,pp\. 8612–8642\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p1.1)\.
- \[29\]Y\. Shi, J\. Liu, Y\. Guan, Z\. Wu, Y\. Zhang, Z\. Wang, W\. Lin, J\. Hua, Z\. Wang, X\. Chen,et al\.\(2025\)Mavors: multi\-granularity video representation for multimodal large language model\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 10994–11003\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p1.1)\.
- \[30\]Y\. Shi, H\. Wang, W\. Xie, H\. Zhang, L\. Zhao, Y\. Zhang, X\. Li, C\. Fu, Z\. Wen, W\. Liu,et al\.\(2025\)Mme\-videoocr: evaluating ocr\-based capabilities of multimodal llms in video scenarios\.arXiv preprint arXiv:2505\.21333\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1)\.
- \[31\]Z\. Su, P\. Xia, H\. Guo, Z\. Liu, Y\. Ma, X\. Qu, J\. Liu, Y\. Li, K\. Zeng, Z\. Yang,et al\.\(2025\)Thinking with images for multimodal reasoning: foundations, methods, and future frontiers\.arXiv preprint arXiv:2506\.23918\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p2.1)\.
- \[32\]K\. Tao, Y\. Zheng, J\. Xu, W\. Du, K\. Shao, H\. Wang, X\. Chen, X\. Jin, J\. Zhu, B\. Yu,et al\.\(2026\)LVOmniBench: pioneering long audio\-video understanding evaluation for omnimodal llms\.arXiv preprint arXiv:2603\.19217\.Cited by:[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p2.1)\.
- \[33\]G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p1.1)\.
- \[34\]C\. Tong, M\. Chang, S\. Zhang, Y\. Wang, C\. Liang, Z\. Zhao, R\. An, B\. Zeng, Y\. Shi, Y\. Dai,et al\.\(2026\)CoF\-t2i: video models as pure visual reasoners for text\-to\-image generation\.arXiv preprint arXiv:2601\.10061\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p1.1)\.
- \[35\]Q\. Wang, Y\. Shi, Y\. Wang, Y\. Zhang, P\. Wan, K\. Gai, X\. Ying, and Y\. Wang\(2025\)Monet: reasoning in latent visual space beyond images and language\.arXiv preprint arXiv:2511\.21395\.Cited by:[§2\.2](https://arxiv.org/html/2605.22012#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2605.22012#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p3.1)\.
- \[36\]Y\. Wang, S\. Wu, Y\. Zhang, S\. Yan, Z\. Liu, J\. Luo, and H\. Fei\(2025\)Multimodal chain\-of\-thought reasoning: a comprehensive survey\.arXiv preprint arXiv:2503\.12605\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p1.1)\.
- \[37\]Y\. Wang, B\. Zeng, C\. Tong, W\. Liu, Y\. Shi, X\. Ma, H\. Liang, Y\. Zhang, and W\. Zhang\(2025\)Scone: bridging composition and distinction in subject\-driven image generation via unified understanding\-generation modeling\.arXiv preprint arXiv:2512\.12675\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p1.1)\.
- \[38\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p2.1)\.
- \[39\]Z\. Xie, M\. Lin, Z\. Liu, P\. Wu, S\. Yan, and C\. Miao\(2025\)Audio\-reasoner: improving reasoning capability in large audio language models\.arXiv preprint arXiv:2503\.02318\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p1.1)\.
- \[40\]Z\. Xing, X\. Hu, C\. Fu, W\. Wang, J\. Dai, and P\. Heng\(2025\)Echoink\-r1: exploring audio\-visual reasoning in multimodal llms via reinforcement learning\.arXiv preprint arXiv:2505\.04623\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1)\.
- \[41\]Z\. Xiong, Y\. Cai, Z\. Li, J\. Yuan, and Y\. Wang\(2025\)Thinking with sound: audio chain\-of\-thought enables multimodal reasoning in large audio\-language models\.arXiv preprint arXiv:2509\.21749\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p2.1)\.
- \[42\]J\. Xu, Z\. Guo, J\. He, H\. Hu, T\. He, S\. Bai, K\. Chen, J\. Wang, Y\. Fan, K\. Dang,et al\.\(2025\)Qwen2\. 5\-omni technical report\.arXiv preprint arXiv:2503\.20215\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2605.22012#S3.SS2.p2.3),[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p3.1)\.
- \[43\]S\. Yan, J\. Han, J\. Tsai, H\. Xue, R\. Fang, L\. Hong, Z\. Guo, and R\. Zhang\(2025\)CrossLMM: decoupling long video sequences from lmms via dual cross\-attention mechanisms\.arXiv preprint arXiv:2505\.17020\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1)\.
- \[44\]S\. Yan, J\. Tong, H\. Xue, X\. Tang, Y\. Wang, K\. Shi, G\. Zhang, R\. Li, and Y\. Zou\(2026\)Act wisely: cultivating meta\-cognitive tool use in agentic multimodal models\.arXiv preprint arXiv:2604\.08545\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p2.1)\.
- \[45\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§3\.3](https://arxiv.org/html/2605.22012#S3.SS3.p2.1),[§3\.3](https://arxiv.org/html/2605.22012#S3.SS3.p3.1)\.
- \[46\]Q\. Yang, S\. Yao, W\. Chen, S\. Fu, D\. Bai, J\. Zhao, B\. Sun, B\. Yin, X\. Wei, and J\. Zhou\(2025\)Humanomniv2: from understanding to omni\-modal reasoning with context\.arXiv preprint arXiv:2506\.21277\.Cited by:[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p3.1)\.
- \[47\]Y\. Yao, T\. Yu, A\. Zhang, C\. Wang, J\. Cui, H\. Zhu, T\. Cai, H\. Li, W\. Zhao, Z\. He,et al\.\(2024\)Minicpm\-v: a gpt\-4v level mllm on your phone\.arXiv preprint arXiv:2408\.01800\.Cited by:[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p3.1)\.
- \[48\]S\. Yin, C\. Fu, S\. Zhao, K\. Li, X\. Sun, T\. Xu, and E\. Chen\(2024\)A survey on multimodal large language models\.National Science Review11\(12\),pp\. nwae403\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1)\.
- \[49\]E\. Zelikman, G\. Harik, Y\. Shao, V\. Jayasiri, N\. Haber, and N\. D\. Goodman\(2024\)Quiet\-star: language models can teach themselves to think before speaking\.arXiv preprint arXiv:2403\.09629\.Cited by:[§2\.2](https://arxiv.org/html/2605.22012#S2.SS2.p1.1)\.
- \[50\]A\. Zeng, X\. Lv, Q\. Zheng, Z\. Hou, B\. Chen, C\. Xie, C\. Wang, D\. Yin, H\. Zeng, J\. Zhang,et al\.\(2025\)Glm\-4\.5: agentic, reasoning, and coding \(arc\) foundation models\.arXiv preprint arXiv:2508\.06471\.Cited by:[§3\.3](https://arxiv.org/html/2605.22012#S3.SS3.p2.1)\.
- \[51\]H\. Zhang, X\. Li, and L\. Bing\(2023\)Video\-llama: an instruction\-tuned audio\-visual language model for video understanding\.InProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations,pp\. 543–553\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p1.1)\.
- \[52\]H\. Zhang, X\. Gu, J\. Li, C\. Ma, S\. Bai, C\. Zhang, B\. Zhang, Z\. Zhou, D\. He, and Y\. Tang\(2025\)Thinking with videos: multimodal tool\-augmented reinforcement learning for long video reasoning\.arXiv preprint arXiv:2508\.04416\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p2.1)\.
- \[53\]Y\. Zhang, T\. Yu, H\. Tian, C\. Fu, P\. Li, J\. Zeng, W\. Xie, Y\. Shi, H\. Zhang, J\. Wu,et al\.\(2025\)Mm\-rlhf: the next step forward in multimodal llm alignment\.arXiv preprint arXiv:2502\.10391\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1)\.
- \[54\]Y\. Zhang, Y\. Shi, W\. Yu, Q\. Wen, X\. Wang, W\. Yang, Z\. Zhang, L\. Wang, and R\. Jin\(2025\)Debiasing multimodal large language models via penalization of language priors\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 4232–4241\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1)\.
- \[55\]Z\. Zhang, T\. Wang, X\. Gong, Y\. Shi, H\. Wang, D\. Wang, and L\. Hu\(2025\)When modalities conflict: how unimodal reasoning uncertainty governs preference dynamics in mllms\.arXiv preprint arXiv:2511\.02243\.Cited by:[§2\.1](https://arxiv.org/html/2605.22012#S2.SS1.p2.1)\.
- \[56\]Z\. Zhang, A\. Zhang, M\. Li, H\. Zhao, G\. Karypis, and A\. Smola\(2023\)Multimodal chain\-of\-thought reasoning in language models\.arXiv preprint arXiv:2302\.00923\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p2.1)\.
- \[57\]Z\. Zhou, R\. Wang, Z\. Wu, and Y\. Jiang\(2025\)Daily\-omni: towards audio\-visual reasoning with temporal alignment across modalities\.arXiv preprint arXiv:2505\.17862\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.22012#S4.SS1.p2.1)\.
- \[58\]H\. Zhu, M\. Luo, R\. Wang, A\. Zheng, and R\. He\(2021\)Deep audio\-visual learning: a survey\.International Journal of Automation and Computing18\(3\),pp\. 351–376\.Cited by:[§1](https://arxiv.org/html/2605.22012#S1.p1.1)\.

## Appendices

## Appendix ADatasets Details

### A\.1Caption Database

AvoCaDOis a newly curated dataset consisting of 107K high\-quality, temporally\-aligned audiovisual video captions\. It emphasizes the temporal orchestration between visual and auditory events, offering semantically rich descriptions paired with precise temporal synchronization\. This dataset is specifically designed to enhance temporal coherence, dialogue accuracy, and comprehensive multimodal alignment in audiovisual video captioning tasks\.

ASIDfeatures a large\-scale collection of one million structured, fine\-grained audiovisual instruction annotations \(ASID\-1M\)\. It provides single\- and multi\-attribute supervision—covering scenes, objects, actions, speech, camera movements, and narrative elements\. Curated through an automated verification and refinement pipeline, this dataset is designed to mitigate hallucinations and facilitate highly controllable, reliable, and fine\-grained video understanding\.

### A\.2AVQA Synthesis Instruction

To synthesize high\-quality AVQA data in both open\-ended and multiple\-choice question \(MCQ\) formats, we design specific instructions\. These prompts direct the model to generate question\-answer pairs that satisfy strict cross\-modal dependencies and adequately reflect the diversity of the source captions\. The complete prompts for open\-ended QA and MCQ synthesis are illustrated in Fig\.[5](https://arxiv.org/html/2605.22012#A2.F5)and Fig\.[6](https://arxiv.org/html/2605.22012#A2.F6), respectively\.

### A\.3AVQA Classification and Filtering Instruction

To implement the category annotation and quality control described in Section[3\.3](https://arxiv.org/html/2605.22012#S3.SS3), we design a joint instruction for GLM\-4\.7\. This prompt directs the model to classify the reasoning type of each preliminary AVQA pair and assess its overall quality, facilitating our threshold\-based filtering process\. The detailed prompt for classification and evaluation is illustrated in Fig\.[7](https://arxiv.org/html/2605.22012#A2.F7)\.

### A\.4Segment Level Caption Synthesis Instruction

To synthesize high\-quality audio\-visual segment\-level captions, we design two distinct prompts tailored for the visual and audio modalities\. These instructions direct the model to produce separate, concise, and objective descriptions that are strongly correlated with the QA pairs, thereby preventing information omission\. The specific prompts for synthesizing observable visual elements and identifiable audio elements are illustrated in Fig\.[8](https://arxiv.org/html/2605.22012#A2.F8)and Fig\.[9](https://arxiv.org/html/2605.22012#A2.F9), respectively\.

### A\.5Segment Level Caption Fusion and Refinement Instruction

To synthesize comprehensive and cohesive audio\-visual captions while resolving narrative fragmentation caused by shot transitions, we design two sequential instructions for caption fusion and refinement\. The detailed prompts for the fusion and refinement processes are illustrated in Fig\.[10](https://arxiv.org/html/2605.22012#A2.F10)and Fig\.[11](https://arxiv.org/html/2605.22012#A2.F11), respectively\.

### A\.6AV Interleaved Reasoning Trajectory Synthesis Instruction

To construct the audio\-visual interleaved reasoning trajectories, we design an instruction for GLM\-4\.7 utilizing the generated AVQA pairs and aligned segment\-level captions\. This prompt directs the model to logically integrate these multimodal elements into a cohesive, step\-by\-step reasoning process\. The detailed prompt for this trajectory synthesis is illustrated in Fig\.[12](https://arxiv.org/html/2605.22012#A2.F12)\.

## Appendix BImplementation Details

We set the key hyperparameters for our training process as follows: the maximum number of frames per sample \(FPS\_MAX\_FRAMES\) is capped at 256\. For optimization, the learning rate is set to10−510^\{\-5\}, with a warmup fraction of 0\.05 to gradually ramp it up at the start of training\. The weighting coefficients for the loss terms are configured asλ1=0\.005\\lambda\_\{1\}=0\.005andλ2=1\.0\\lambda\_\{2\}=1\.0\. Furthermore, due to limited computational resources, we restrict the batch size to 1, paired with 12 gradient accumulation steps to maintain an adequate effective batch size for stable optimization\.

Prompt 1: AV Open\-Ended Question\-Answer SynthesisRole\.You are an expert multimodal dataset designer specializing in Audio\-Visual Question Answering \(AVQA\)\.Input\.Temporally aligned audio and video captions describing synchronized events\.Task Description\.1\.Context Comprehension:Thoroughly analyze the provided audio and video captions to understand the synchronized multimodal events and their temporal correlations\.2\.Open\-Ended Generation:Synthesize exactlyonehigh\-quality, open\-ended question\-answer pair that demands complex, multi\-step cross\-modal reasoning\.3\.Structured Output:Format the final result strictly as a JSON object containing the question, concise answer, and the specific reasoning type employed\.Hard Constraints\.•Cross\-Modal Information Dependency:The question must strictly rely on the synthesis of both visual and audio information\. It must be logically impossible to deduce the answer using only a single modality\.•Reasoning Typology:The generated question must explicitly target a distinct complex reasoning category \(e\.g\., causal reasoning, spatial relations, temporal sequencing, sound\-action attribution, or object interactions\)\.•Answer Accuracy:The question and its concise answer \(maximum 10 words\) must be factually accurate, concrete, and grounded strictly within the provided caption content, with zero external hallucination\.•Format and Style:Avoid using object IDs, bounding box labels, timestamps, or raw XML tags in the generated question and answer\.•Commonsense Integration:Encourage structural world knowledge when appropriate, such as logically linking visible physical actions with expected environmental acoustics\.Output Format\.Output raw JSON only\. Do not wrap the output in Markdown blocks\.Reference JSON Schema\.``` { "id": "OpenQA_01", "modality": "AV", "question": "...", "answer": "..." } ```

Figure 5:Prompt used to synthesize a single complex, open\-ended AVQA pair from temporally aligned audio\-visual captions\.Prompt 2: AV Multiple\-Choice QA SynthesisRole\.You are an expert multimodal dataset designer specializing in Audio\-Visual Question Answering \(AVQA\)\.Input\.Temporally aligned audio and video captions describing synchronized events\.Task Description\.1\.Context Comprehension:Thoroughly analyze the provided audio and video captions to understand the synchronized multimodal events and their temporal correlations\.2\.Multiple\-Choice Generation:Synthesize exactlyonehigh\-quality multiple\-choice question \(MCQ\) that demands complex, multi\-step cross\-modal reasoning\.3\.Structured Output:Format the final result strictly as a JSON object containing the question, four distinct options, the correct answer, and the specific reasoning type employed\.Hard Constraints\.•Cross\-Modal Information Dependency:The question must strictly rely on the synthesis of both visual and audio information\. It must be logically impossible to deduce the answer using only a single modality\.•Reasoning Typology:The generated question must explicitly target a distinct complex reasoning category \(e\.g\., causal reasoning, spatial relations, temporal sequencing, sound\-action attribution, or object interactions\)\.•Distractor Quality:The MCQ must contain exactly four options\. The three incorrect options \(distractors\) must be plausible, misleading, and reflect reasonable alternative interpretations, but must be definitively incorrect without any ambiguity or subjectivity\.•Answer Accuracy:The strictly correct answer must be factually accurate, concrete, and grounded entirely within the provided caption content, with zero external hallucination\.•Format and Style:Avoid using object IDs, bounding box labels, timestamps, or raw XML tags in the generated question and options\.Output Format\.Output raw JSON only\. Do not wrap the output in Markdown blocks\.Reference JSON Schema\.``` { "id": "MCQ_01", "modality": "AV", "question": "...", "options": [ "A. ...", "B. ...", "C. ...", "D. ..." ], "answer": "B", "answer_text": "B. ..." } ```

Figure 6:Prompt used to synthesize a single complex, multiple\-choice AVQA pair with grounded distractors and a unique correct answer from temporally aligned captions\.Prompt 3: AVQA Dataset Quality Evaluation and ClassificationRole\.You are an expert evaluator and classifier specializing in Multimodal Large Language Models \(MLLMs\) and Audio\-Visual Question Answering \(AVQA\) datasets\.Objective\.Perform TWO tasks based on the provided inputs:1\.Objectively evaluate the quality, rigor, and grounding of the provided Question\-Answer pair\.2\.Classify the user’s question into one specific AVQA category AND determine its primary modality dependency\.Input Data\.•\[Standard AV Caption\]:\{AV\_caption\}•\[Question\]:\{question\}•\[Ground Truth Answer\]:\{answer\}Task 1: QA Quality Evaluation \(1\-5 Scale\)\.Evaluate the provided QA pair across the following 6 dimensions:1\.Context Utilization & Relevance \(1\-5\): Does the question effectively target the provided modality context? \(5 = strictly relies on necessary AV information; 1 = ignores context or relies on external general knowledge\)\.2\.Question Difficulty \(1\-5\): How inherently difficult is the question? \(5 = highly complex, multi\-step reasoning or nuanced integration; 1 = simple, shallow factual lookup\)\.3\.Deductive Requirement \(1\-5\): Does answering the question require genuine logical deduction from the observations? \(5 = requires deep step\-by\-step inference; 1 = pure parroting or trivial text matching\)\.Task 2: Question Classification & Modality\.First, classify the question into EXACTLY ONE of the following 10 categories: \(1\) Audio\-Visual Joint Perception, \(3\) Action & Behavior Recognition, \(3\) Spatial Layout Understanding, \(4\) Temporal Sequence Understanding, \(5\) Attribute Comparison & Change, \(6\) Counting & Quantification, \(7\) Emotion & Atmosphere Perception, \(8\) Semantic Content Summarization, \(9\) Logical Relation Reasoning, \(10\) Intention & Outcome Prediction\.Second, determine the primary modality dependency\. Choose EXACTLY ONE:•AV\-Strong: Requires logically combining visual and auditory cues\.•Video\-Strong: Relies primarily on visual information\.•Audio\-Strong: Relies primarily on auditory information\.Output Format\.Output ONLY a valid JSON object\. Do not include markdown code blocks, conversational text, or explanations outside the JSON structure\.Reference JSON Schema\.``` { "evaluation": { "context_utilization": <int, 1-5>, "question_difficulty": <int, 1-5>, "deductive_requirement": <int, 1-5> }, "classification": { "category_id": <int, 1-10>, "category_name": "<string>", "modality_dependency": "<AV-Strong|Video-Strong|Audio-Strong>", "confidence": <float, 0.0-1.0>, "reasoning": "<string, 1-2 sentences>" } } ```

Figure 7:Prompt used to evaluate the intrinsic quality and modality dependency of synthesized AVQA pairs against standard captions\.Prompt 4: Segment Level Caption Synthesis \(Video\)Given the provided QA pair, provide a CONCISE yet complete visual\-only description of the video segment that contains relevant information to answer the question\. Limit the description to no more than five sentences and avoid redundant details\. Describe only directly observable visual elements: setting, people, actions, objects, and camera movement\. Do not infer mood, intent, genre, cultural style, or add interpretation, and strictly avoid speculation and evaluative language\.

Figure 8:Prompt used to synthesize a concise video caption focusing exclusively on observable visual elements guided by a specific QA pair\.Prompt 5: Segment Level Caption Synthesis \(Audio\)Given the provided QA pair, provide a CONCISE yet complete audio\-only description of the segment that contains relevant sound information to answer the question\. Limit the description to no more than five sentences and avoid redundant details\. State only clearly identifiable sound sources \(e\.g\., music, instruments, environmental noises\)\. If speech is present, accurately report the speaker and the spoken content\. Do not infer mood, intent, genre, or cultural style, and strictly avoid speculation and atmospheric language\.

Figure 9:Prompt used to synthesize a concise audio caption detailing identifiable sounds and speech relevant to a specific QA pair\.Prompt 6: Segment Level Caption FusionYou are tasked with fusing the visual caption and audio caption into a single, coherent narrative based on the video content\. Follow these strict rules:1\. Preserve every single sentence from both the visual caption and audio caption exactly as they appear\.2\. Do NOT omit or delete any sentence in any way\.3\. You may reorder the sentences \(from both captions\) to create a logical and temporally accurate sequence that reflects the video’s events\.4\. Ensure the integrated narrative flows naturally in time with the video, aligning visual actions with corresponding sounds or spoken content\.Verify before responding: Did I include every sentence from both captions?

Figure 10:Prompt used to fuse the segment\-level video caption and audio caption\.Prompt 7: Segment Caption RefinementTask:Refine a fragmented video segment caption using the full video caption as a reference\.Inputs:•\[Full Caption\]:\[full\_AV\_caption\]•\[Segment Caption\]:\[segment\_AV\_caption\]Instructions:Compare the \[Segment Caption\] with the \[Full Caption\] to fix shot fragmentation\. Resolve incomplete actions, abrupt cuts, or missing subjects caused by rigid segmentation by accurately contextualizing actions that span across boundaries\. Do not describe events outside the segment’s timestamp\. Keep the refined caption concise, strictly objective, and output ONLY the refined text without any formatting tags or explanations\.

Figure 11:Prompt used to refine fragmented segment captions by cross\-referencing full captions to address shot fragmentation and maintain temporal constraints\.Prompt 8: AV Interleaved Reasoning Trajectory SynthesisRole\.You are an intelligent audio\-visual analysis expert with advanced perception capabilities\. Your task is to answer the user’s question by rigorously analyzing the audio and video content step by step\.Critical Context\.You will be provided with specific segments and detailed descriptive content\.You must treat this detailed content as your own direct sensory perception \(sight and hearing\)\.You are NOT analyzing a text; you are describing the raw video and audio you are "watching"\.Strict Prohibitions\.•Never mention words such as “caption”, “text”, “description”, “reference”, “provided info”, or “input”\.•Never say “according to the description” or “the text says”\.•Never discuss discrepancies between visual or audio evidence and text; instead, state the perceived event as the ground truth\.Reasoning Instructions\.1\.Construct a step\-by\-step reasoning chain that actively decides when source segments must be revisited\.2\.When a segment is needed, cite it using the format \[Segment n\]\.3\.Cite at least one segment and at most three segments in total, favoring the earliest or most decisive evidence\.4\.Continue the reasoning without explicitly referring back to segment numbers in the narrative text\.Final Answer Requirement\.End the response with a concise boxed answer in the format \\boxed\{your final answer here\}, with no extra commentary afterward\.Reference Output Pattern\.Reason \-\> \[Segment n\] \-\> Reason \-\> … \-\> \\boxed\{final answer\}

Figure 12:Prompt used to synthesize interleaved reasoning trajectories with explicit segment citations and concise grounded observations\.
## Appendix CCase Study

In this section, we present examples from DailyOmni benchmark to demonstrate the reasoning capabilities of LatentOmni across diverse audio\-visual tasks\. For clarity, we abstain from projecting the generated latent embeddings into the discrete language space, as this would yield uninterpretable tokens\. Instead, we represent the latent reasoning segments using <Unified\_Latent\><latent\_embeddings\></Unified\_Latent\>\. The selected examples encompass three representative scenarios: AV Event Alignment \(Figure[13](https://arxiv.org/html/2605.22012#A4.F13)\), Inference \(Figure[14](https://arxiv.org/html/2605.22012#A4.F14)\), and Reasoning \(Figure[15](https://arxiv.org/html/2605.22012#A4.F15)\)\.

To further analyze the model’s intrinsic reasoning process within the latent space, we concurrently visualize the attention maps between the latent reasoning tokens and the original audio\-visual inputs alongside the generated outputs\. These visualizations intuitively illustrate how the latent states dynamically track and anchor to fine\-grained multimodal evidence during generation\. These qualitative analyses highlight the model’s capability to preserve and reason over continuous multimodal semantics throughout the inference process\.

## Appendix DLimitation

While LatentOmni establishes a robust framework for unified latent reasoning across visual, auditory, and textual modalities, it inherently shares a common boundary with current state\-of\-the\-art multimodal systems regarding modality coverage\. Real\-world environments are fundamentally more complex, encompassing a broader spectrum of sensory and control signals, such as 3D spatial representations, tactile physics, and motor action commands\. Currently, mapping these extended physical and interactive signals into a single unified latent space remains an open challenge for the community\. In future work, we aim to explore the expansion of our latent semantic bridge to accommodate a wider array of heterogeneous modalities, ultimately taking a step towards a more comprehensive and embodied omni\-modal reasoning system\.

![Refer to caption](https://arxiv.org/html/2605.22012v1/x5.png)Figure 13:LatentOmni example: AV Event Alignment\. LatentOmni accurately anchors task\-relevant audio\-visual frames within the latent space\. As demonstrated by the attention visualization, deeper colors indicate higher attention weights precisely localized on the key multimodal clues\.![Refer to caption](https://arxiv.org/html/2605.22012v1/x6.png)Figure 14:LatentOmni example: Inference\. LatentOmni accurately anchors task\-relevant audio\-visual frames within the latent space\. As demonstrated by the attention visualization, deeper colors indicate higher attention weights precisely localized on the key multimodal clues\.![Refer to caption](https://arxiv.org/html/2605.22012v1/x7.png)Figure 15:LatentOmni example: Reasoning\. LatentOmni accurately anchors task\-relevant audio\-visual frames within the latent space\. As demonstrated by the attention visualization, deeper colors indicate higher attention weights precisely localized on the key multimodal clues\.

Similar Articles

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

arXiv cs.CL

This paper introduces OmniThoughtVis, a scalable pipeline for distilling multimodal reasoning capabilities from large teacher models to smaller, deployment-oriented MLLMs. The method uses curated chain-of-thought data to significantly improve reasoning performance on benchmarks like MathVerse and MMMU-Pro for models ranging from 2B to 8B parameters.

When Vision Speaks for Sound

Hugging Face Daily Papers

This paper identifies that video-capable multimodal LLMs often appear to understand audio but actually rely on visual cues, a failure mode termed the audio-visual Clever Hans effect. It introduces Thud, an intervention-driven probing framework to diagnose this issue, and proposes an alignment recipe that improves audio-visual consistency by 28 percentage points.