LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Hugging Face Daily Papers Papers

Summary

LatentOmni proposes a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states, outperforming explicit text-based chain-of-thought methods in audio-visual reasoning tasks.

Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose LatentOmni, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct LatentOmni-Instruct-35K, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.
Original Article
View Cached Full Text

Cached at: 05/22/26, 06:27 AM

Paper page - LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Source: https://huggingface.co/papers/2605.22012 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

LatentOmni is a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states using feature-level supervision and temporal consistency embedding, outperforming explicit text-based chain-of-thought approaches in audio-visual reasoning tasks.

Jointaudio-visual reasoningis essential for omnimodal understanding, yet currentmultimodal large language models(MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-basedchain-of-thought(CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unifiedlatent spaceis a better medium for such reasoning because it preserves densesensory informationwhile remaining compatible withautoregressive generation. Based on this insight, we propose LatentOmni, across-modal reasoningframework that interleaves textual reasoning with audio-visual latent states. LatentOmni introducesfeature-level supervisionto align latent reasoning states with task-relevant sensory features and usesOmni-Sync Position Embedding(OSPE) to maintaintemporal consistencybetween latent audio and visual states. We further construct LatentOmni-Instruct-35K, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multipleaudio-visual reasoningbenchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.22012

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.22012 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.22012 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.22012 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

arXiv cs.AI

CaVe-VLM-CoT is a modular reflection-based agentic-RAG framework for vision-language models that enforces evidence-grounded reasoning through a five-stage pipeline, achieving 87.1% accuracy on ScienceQA and proposing a suite of 23 metrics for evaluation.

@grapeot: Reasoning models aren't the bombshell of 2024. Many people, upon first seeing o1 "think" for over ten seconds before answering, felt that models had suddenly learned to reason overnight. But stretching out the timeline, from CoT prompting (2022) to o1, a full four years passed in between. Three things often conflated: 1. Reasoning ability itself—already amplified by CoT systems in 2022 2. Training reasoning via reinforcement learning—academic prototypes of PRM existed in 2023 3. Turning reasoning into a billable, schedulable resource—this is the real watershed of 2024.

X AI KOLs Timeline

A deep retrospective on the four-year evolution of reasoning models from CoT in 2022 to o1/R1 in 2024, pointing out that the true watershed is not the emergence of reasoning ability, but the conversion of reasoning into a billable, schedulable resource.