Native Active Perception as Reasoning for Omni-Modal Understanding

Hugging Face Daily Papers Papers

Summary

Introduces OmniAgent, an omni-modal agent that uses an iterative Observation-Thought-Action cycle with active perception to achieve superior long video understanding, outperforming larger models like Qwen2.5-VL-72B on benchmarks.

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10times larger Qwen2.5-VL-72B (50.5% vs. 47.3%).
Original Article
View Cached Full Text

Cached at: 06/18/26, 07:55 AM

Paper page - Native Active Perception as Reasoning for Omni-Modal Understanding

Source: https://huggingface.co/papers/2606.19341 Authors:

,

,

,

,

,

,

,

,

,

Abstract

OmniAgent is a novel omni-modal agent that addresses long video understanding by using an iterative observation-thought-action cycle with active perception, achieving superior performance compared to larger models through efficient selective processing.

Passive models for longvideo understandingtypically rely on a “watch-it-all” paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first nativeomni-modal agentthat formulatesvideo understandingas aPOMDP-based iterativeObservation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1)Agentic Supervised Fine-Tuningto bootstrap nativeactive perceptionvia best-of-N trajectory synthesis with dual-stage quality control, and (2)Agentic Reinforcement LearningwithTAURA(Turn-aware Adaptive Uncertainty Rescaled Advantage), which leveragesturn-level entropyto steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy ofactive perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10times larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

View arXiv pageView PDFGitHub9Add to collection

Get this paper in your agent:

hf papers read 2606\.19341

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper2

#### harryhsing/OmniAgent-RL-7B Video-Text-to-Text• 9B• Updatedabout 5 hours ago #### harryhsing/OmniAgent-SFT-7B Video-Text-to-Text• 11B• Updatedabout 5 hours ago

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.19341 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.19341 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles