OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Hugging Face Daily Papers 05/26/26, 12:00 AM Papers

benchmarking streaming-interaction omnimodal real-time audio-visual llm-evaluation temporal-grounding

Summary

OmniInteract introduces a streaming benchmark for real-time omnimodal LLMs, evaluating online audio-visual processing with temporal grounding and interactive response requirements. Experiments show that current models perform poorly, with the best overall IA-QTF1 score reaching only 0.368.

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.

Original Article

View Cached Full Text

Cached at: 05/29/26, 07:01 AM

Paper page - OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Source: https://huggingface.co/papers/2605.26485 Authors:

Abstract

OmniInteract presents a streaming benchmark for real-time omnimodal large language models that evaluates online audio-visual processing with temporal grounding and interactive response requirements.

We introduce OmniInteract, astreaming benchmarkfor real-timeomnimodal large language modelsevaluated through nativeonline inferenceoveraudio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detectmultimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430temporally grounded response slots: 1,0621Q1A slotsacross real-time, proactive, and nested scenarios, and 3681QnA slotsfor continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity usingInteraction-Aware Quality-Timeliness F1,Interruption Diagnostic Suite, andNested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.

View arXiv page View PDF GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2605\.26485

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.26485 in a model README.md to link it from this page.

Datasets citing this paper1

#### lucky-lance/OmniInteract Updatedabout 4 hours ago

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.26485 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Paper page - OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

Submit Feedback

Similar Articles

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization