LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
Summary
LatentOmni proposes a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states, outperforming explicit text-based chain-of-thought methods in audio-visual reasoning tasks.
View Cached Full Text
Cached at: 05/22/26, 06:27 AM
Paper page - LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
Source: https://huggingface.co/papers/2605.22012 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
LatentOmni is a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states using feature-level supervision and temporal consistency embedding, outperforming explicit text-based chain-of-thought approaches in audio-visual reasoning tasks.
Jointaudio-visual reasoningis essential for omnimodal understanding, yet currentmultimodal large language models(MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-basedchain-of-thought(CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unifiedlatent spaceis a better medium for such reasoning because it preserves densesensory informationwhile remaining compatible withautoregressive generation. Based on this insight, we propose LatentOmni, across-modal reasoningframework that interleaves textual reasoning with audio-visual latent states. LatentOmni introducesfeature-level supervisionto align latent reasoning states with task-relevant sensory features and usesOmni-Sync Position Embedding(OSPE) to maintaintemporal consistencybetween latent audio and visual states. We further construct LatentOmni-Instruct-35K, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multipleaudio-visual reasoningbenchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.22012
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.22012 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.22012 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.22012 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
HELP WITH RESEARCH: Observation - Semantically Dense Context Produces Strong Late-Layer Divergence Without Jailbreak Prompts [D]
An empirical study demonstrating that long, semantically dense, benign text can shift a model's latent space and bypass alignment, causing it to generate otherwise blocked critiques. The author, a non-expert, requests an audit of their metrics to distinguish genuine semantic hijacking from artifacts.
ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection
ThinkDeception proposes a novel framework that leverages multimodal large language models and a progressive reinforcement learning strategy with chain-of-thought reasoning for interpretable deception detection, achieving new state-of-the-art results on standard benchmarks.
CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework
CaVe-VLM-CoT is a modular reflection-based agentic-RAG framework for vision-language models that enforces evidence-grounded reasoning through a five-stage pipeline, achieving 87.1% accuracy on ScienceQA and proposing a suite of 23 metrics for evaluation.
@grapeot: Reasoning models aren't the bombshell of 2024. Many people, upon first seeing o1 "think" for over ten seconds before answering, felt that models had suddenly learned to reason overnight. But stretching out the timeline, from CoT prompting (2022) to o1, a full four years passed in between. Three things often conflated: 1. Reasoning ability itself—already amplified by CoT systems in 2022 2. Training reasoning via reinforcement learning—academic prototypes of PRM existed in 2023 3. Turning reasoning into a billable, schedulable resource—this is the real watershed of 2024.
A deep retrospective on the four-year evolution of reasoning models from CoT in 2022 to o1/R1 in 2024, pointing out that the true watershed is not the emergence of reasoning ability, but the conversion of reasoning into a billable, schedulable resource.
Investigating Implicit Latent Trajectory Shifts: Bypassing Alignment via Long-Form Coherent Context
An empirical study investigating how long, semantically dense benign text can shift a model's latent space trajectory, diluting initial system prompts and bypassing post-training alignment constraints, as observed in both closed and open-source models.