Native Active Perception as Reasoning for Omni-Modal Understanding
Summary
Introduces OmniAgent, an omni-modal agent that uses an iterative Observation-Thought-Action cycle with active perception to achieve superior long video understanding, outperforming larger models like Qwen2.5-VL-72B on benchmarks.
View Cached Full Text
Cached at: 06/18/26, 07:55 AM
Paper page - Native Active Perception as Reasoning for Omni-Modal Understanding
Source: https://huggingface.co/papers/2606.19341 Authors:
,
,
,
,
,
,
,
,
,
Abstract
OmniAgent is a novel omni-modal agent that addresses long video understanding by using an iterative observation-thought-action cycle with active perception, achieving superior performance compared to larger models through efficient selective processing.
Passive models for longvideo understandingtypically rely on a “watch-it-all” paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first nativeomni-modal agentthat formulatesvideo understandingas aPOMDP-based iterativeObservation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1)Agentic Supervised Fine-Tuningto bootstrap nativeactive perceptionvia best-of-N trajectory synthesis with dual-stage quality control, and (2)Agentic Reinforcement LearningwithTAURA(Turn-aware Adaptive Uncertainty Rescaled Advantage), which leveragesturn-level entropyto steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy ofactive perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10times larger Qwen2.5-VL-72B (50.5% vs. 47.3%).
View arXiv pageView PDFGitHub9Add to collection
Get this paper in your agent:
hf papers read 2606\.19341
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper2
#### harryhsing/OmniAgent-RL-7B Video-Text-to-Text• 9B• Updatedabout 5 hours ago
#### harryhsing/OmniAgent-SFT-7B Video-Text-to-Text• 11B• Updatedabout 5 hours ago
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.19341 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.19341 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
LatentOmni proposes a unified latent space for audio-visual reasoning, avoiding the information loss of text-based chain-of-thought. It achieves state-of-the-art performance among open-source models on audio-visual reasoning benchmarks.
X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction
This technical report introduces X-OmniClaw, a unified mobile agent system designed for multimodal understanding and interaction on Android devices. It details the architecture for perception, memory management, and action execution using on-device AI capabilities.
Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning
Visual-Seeker proposes a visual-native multimodal deep search agent that actively reasons over fine-grained visual details and synthesizes multimodal evidence, achieving state-of-the-art performance on five challenging multimodal search benchmarks.
OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
OmniPro is the first benchmark for evaluating proactive streaming video understanding in omni-modal large language models, featuring 2,700 samples covering diverse tasks and dual-mode evaluation protocols.
Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning
Visual Para-Thinker++ proposes a single-policy multi-agent framework for visual reasoning that uses role-conditioned agents (Main, Worker, Summary) and dedicated training methods to reduce hallucinations and improve efficiency, outperforming baselines on hallucination-sensitive benchmarks.