OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Hugging Face Daily Papers Papers

Summary

OmniInteract introduces a streaming benchmark for real-time omnimodal LLMs, evaluating online audio-visual processing with temporal grounding and interactive response requirements. Experiments show that current models perform poorly, with the best overall IA-QTF1 score reaching only 0.368.

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.
Original Article
View Cached Full Text

Cached at: 05/29/26, 07:01 AM

Paper page - OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Source: https://huggingface.co/papers/2605.26485 Authors:

,

,

,

,

,

,

,

,

,

Abstract

OmniInteract presents a streaming benchmark for real-time omnimodal large language models that evaluates online audio-visual processing with temporal grounding and interactive response requirements.

We introduce OmniInteract, astreaming benchmarkfor real-timeomnimodal large language modelsevaluated through nativeonline inferenceoveraudio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detectmultimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430temporally grounded response slots: 1,0621Q1A slotsacross real-time, proactive, and nested scenarios, and 3681QnA slotsfor continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity usingInteraction-Aware Quality-Timeliness F1,Interruption Diagnostic Suite, andNested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.

View arXiv pageView PDFGitHub1Add to collection

Get this paper in your agent:

hf papers read 2605\.26485

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.26485 in a model README.md to link it from this page.

Datasets citing this paper1

#### lucky-lance/OmniInteract Updatedabout 4 hours ago

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.26485 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Hugging Face Daily Papers

This paper introduces Omni-DuplexEval, a benchmark and automatic evaluation framework for real-time duplex interaction in multimodal large language models, assessing continuous response generation and proactive event detection in streaming scenarios.

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

Hugging Face Daily Papers

OmniGUI introduces a step-level benchmark for GUI agents that integrates static images, synchronous audio, and video clips to simulate real smartphone interactions. Evaluation shows current models struggle with temporal and auditory inputs, highlighting the need for omni-modal capabilities.

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Papers with Code Trending

OVO-S-Bench introduces a comprehensive human-annotated benchmark of 1,680 questions across 348 videos to evaluate streaming spatial intelligence in multimodal LLMs, revealing that even the best model (Gemini-3.1-Pro) trails human experts by 27 points. The benchmark exposes key limitations including allocentric mapping as a major bottleneck and chain-of-thought reasoning amplifying spatial errors.