OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
Summary
OmniPro is the first benchmark for evaluating proactive streaming video understanding in omni-modal large language models, featuring 2,700 samples covering diverse tasks and dual-mode evaluation protocols.
View Cached Full Text
Cached at: 05/22/26, 10:19 AM
Paper page - OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
Source: https://huggingface.co/papers/2605.18577
Abstract
OmniPro is introduced as the first benchmark for evaluating omni-modal large language models’ proactive streaming video understanding, featuring diverse tasks and dual-mode evaluation protocols.
Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability ofomni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grainedmultimodal analysis. We further introduce adual-mode evaluation protocol:Probe modeassesses content understanding by querying the model before and after each ground-truth trigger, whileOnline modeevaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.
View arXiv pageView PDFProject pageGitHub5Add to collection
Get this paper in your agent:
hf papers read 2605\.18577
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.18577 in a model README.md to link it from this page.
Datasets citing this paper1
#### RuixiangZhao/OmniPro Viewer• Updated3 days ago • 2.7k • 977 • 2
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.18577 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction
This paper introduces Omni-DuplexEval, a benchmark and automatic evaluation framework for real-time duplex interaction in multimodal large language models, assessing continuous response generation and proactive event detection in streaming scenarios.
OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants
OmniInteract introduces a streaming benchmark for real-time omnimodal LLMs, evaluating online audio-visual processing with temporal grounding and interactive response requirements. Experiments show that current models perform poorly, with the best overall IA-QTF1 score reaching only 0.368.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
This paper presents OmniClean, a visually debiased evaluation benchmark for omni-modal language models, and proposes OmniBoost, a three-stage post-training recipe that enables a 3B model to match the performance of a 30B model on the cleaned benchmark.
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
LatentOmni proposes a unified latent space for audio-visual reasoning, avoiding the information loss of text-based chain-of-thought. It achieves state-of-the-art performance among open-source models on audio-visual reasoning benchmarks.
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
This paper introduces Omni-Persona, the first comprehensive benchmark for omnimodal personalization across text, image, and audio, featuring a Persona Modality Graph and a new Calibrated Accuracy metric to evaluate grounding behaviors.