OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Hugging Face Daily Papers 06/07/26, 12:00 AM Papers

omni-modal captioning instruction-following benchmark video-understanding multi-modal dataset

Summary

Introduces OmniCap-IF, the first comprehensive benchmark for evaluating instruction-following in omni-modal video captioning, revealing a format-content tradeoff and proposing improved models and datasets.

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.

Original Article

View Cached Full Text

Cached at: 06/09/26, 12:41 PM

Paper page - OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Source: https://huggingface.co/papers/2606.08572 Authors:

Abstract

OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning.

WhileOmni-modal Large Language Models(OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex,multi-faceted user instructionsremains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluateinstruction-followingcapabilities inomni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions:format correctnessandcontent correctness. Our benchmark encompasses 50 distinctconstraint typesacross pure visual, pure audio, and audio-visual modalities, while integratingTemporal Groundingto assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical “format-content tradeoff”, demonstrating that increasing formatting complexity directly degrades models’ omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and generalomni-modal captioningperformance.

View arXiv page View PDF Project page GitHub4 Add to collection

Get this paper in your agent:

hf papers read 2606\.08572

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper2

#### NJU-LINK/OmniCaptioner-IF-7B Image-Text-to-Text• 11B• Updatedabout 3 hours ago • 26 #### NJU-LINK/OmniCaptioner-IF-3B Image-Text-to-Text• 6B• Updatedabout 3 hours ago • 35

Datasets citing this paper2

#### NJU-LINK/OmniCap-IF Viewer• Updatedabout 3 hours ago • 480 • 516 • 1 #### NJU-LINK/OmniCap-IF-54K Viewer• Updatedabout 3 hours ago • 53.9k • 170

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.08572 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Paper page - OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Abstract

Models citing this paper2

Datasets citing this paper2

Spaces citing this paper0

Collections including this paper0

Similar Articles

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation

Submit Feedback

Similar Articles

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation