OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Hugging Face Daily Papers Papers

Summary

Introduces OmniCap-IF, the first comprehensive benchmark for evaluating instruction-following in omni-modal video captioning, revealing a format-content tradeoff and proposing improved models and datasets.

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.
Original Article
View Cached Full Text

Cached at: 06/09/26, 12:41 PM

Paper page - OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Source: https://huggingface.co/papers/2606.08572 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning.

WhileOmni-modal Large Language Models(OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex,multi-faceted user instructionsremains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluateinstruction-followingcapabilities inomni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions:format correctnessandcontent correctness. Our benchmark encompasses 50 distinctconstraint typesacross pure visual, pure audio, and audio-visual modalities, while integratingTemporal Groundingto assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical “format-content tradeoff”, demonstrating that increasing formatting complexity directly degrades models’ omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and generalomni-modal captioningperformance.

View arXiv pageView PDFProject pageGitHub4Add to collection

Get this paper in your agent:

hf papers read 2606\.08572

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper2

#### NJU-LINK/OmniCaptioner-IF-7B Image-Text-to-Text• 11B• Updatedabout 3 hours ago • 26 #### NJU-LINK/OmniCaptioner-IF-3B Image-Text-to-Text• 6B• Updatedabout 3 hours ago • 35

Datasets citing this paper2

#### NJU-LINK/OmniCap-IF Viewer• Updatedabout 3 hours ago • 480 • 516 • 1 #### NJU-LINK/OmniCap-IF-54K Viewer• Updatedabout 3 hours ago • 53.9k • 170

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.08572 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

arXiv cs.CL

VCIFBench is a new benchmark for evaluating complex instruction following in video understanding, featuring 306 test instructions with content, format, style, and structure constraints, plus a DPO preference dataset. Experiments on 10 MLLMs reveal that joint constraint satisfaction remains challenging, and DPO training on the benchmark data improves instruction-following performance.