OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants
Summary
OmniInteract introduces a streaming benchmark for real-time omnimodal LLMs, evaluating online audio-visual processing with temporal grounding and interactive response requirements. Experiments show that current models perform poorly, with the best overall IA-QTF1 score reaching only 0.368.
View Cached Full Text
Cached at: 05/29/26, 07:01 AM
Paper page - OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants
Source: https://huggingface.co/papers/2605.26485 Authors:
,
,
,
,
,
,
,
,
,
Abstract
OmniInteract presents a streaming benchmark for real-time omnimodal large language models that evaluates online audio-visual processing with temporal grounding and interactive response requirements.
We introduce OmniInteract, astreaming benchmarkfor real-timeomnimodal large language modelsevaluated through nativeonline inferenceoveraudio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detectmultimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430temporally grounded response slots: 1,0621Q1A slotsacross real-time, proactive, and nested scenarios, and 3681QnA slotsfor continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity usingInteraction-Aware Quality-Timeliness F1,Interruption Diagnostic Suite, andNested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.
View arXiv pageView PDFGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.26485
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.26485 in a model README.md to link it from this page.
Datasets citing this paper1
#### lucky-lance/OmniInteract Updatedabout 4 hours ago
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.26485 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction
This paper introduces Omni-DuplexEval, a benchmark and automatic evaluation framework for real-time duplex interaction in multimodal large language models, assessing continuous response generation and proactive event detection in streaming scenarios.
OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments
OmniGUI introduces a step-level benchmark for GUI agents that integrates static images, synchronous audio, and video clips to simulate real smartphone interactions. Evaluation shows current models struggle with temporal and auditory inputs, highlighting the need for omni-modal capabilities.
OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
OmniPro is the first benchmark for evaluating proactive streaming video understanding in omni-modal large language models, featuring 2,700 samples covering diverse tasks and dual-mode evaluation protocols.
OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs
OVO-S-Bench introduces a comprehensive human-annotated benchmark of 1,680 questions across 348 videos to evaluate streaming spatial intelligence in multimodal LLMs, revealing that even the best model (Gemini-3.1-Pro) trails human experts by 27 points. The benchmark exposes key limitations including allocentric mapping as a major bottleneck and chain-of-thought reasoning amplifying spatial errors.
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
This paper introduces Omni-Persona, the first comprehensive benchmark for omnimodal personalization across text, image, and audio, featuring a Persona Modality Graph and a new Calibrated Accuracy metric to evaluate grounding behaviors.