OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning
Summary
Introduces OmniCap-IF, the first comprehensive benchmark for evaluating instruction-following in omni-modal video captioning, revealing a format-content tradeoff and proposing improved models and datasets.
View Cached Full Text
Cached at: 06/09/26, 12:41 PM
Paper page - OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning
Source: https://huggingface.co/papers/2606.08572 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning.
WhileOmni-modal Large Language Models(OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex,multi-faceted user instructionsremains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluateinstruction-followingcapabilities inomni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions:format correctnessandcontent correctness. Our benchmark encompasses 50 distinctconstraint typesacross pure visual, pure audio, and audio-visual modalities, while integratingTemporal Groundingto assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical “format-content tradeoff”, demonstrating that increasing formatting complexity directly degrades models’ omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and generalomni-modal captioningperformance.
View arXiv pageView PDFProject pageGitHub4Add to collection
Get this paper in your agent:
hf papers read 2606\.08572
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper2
#### NJU-LINK/OmniCaptioner-IF-7B Image-Text-to-Text• 11B• Updatedabout 3 hours ago • 26
#### NJU-LINK/OmniCaptioner-IF-3B Image-Text-to-Text• 6B• Updatedabout 3 hours ago • 35
Datasets citing this paper2
#### NJU-LINK/OmniCap-IF Viewer• Updatedabout 3 hours ago • 480 • 516 • 1 #### NJU-LINK/OmniCap-IF-54K Viewer• Updatedabout 3 hours ago • 53.9k • 170
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.08572 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
OmniPro is the first benchmark for evaluating proactive streaming video understanding in omni-modal large language models, featuring 2,700 samples covering diverse tasks and dual-mode evaluation protocols.
VCIFBench: Evaluating Complex Instruction Following for Video Understanding
VCIFBench is a new benchmark for evaluating complex instruction following in video understanding, featuring 306 test instructions with content, format, style, and structure constraints, plus a DPO preference dataset. Experiments on 10 MLLMs reveal that joint constraint satisfaction remains challenging, and DPO training on the benchmark data improves instruction-following performance.
OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants
OmniInteract introduces a streaming benchmark for real-time omnimodal LLMs, evaluating online audio-visual processing with temporal grounding and interactive response requirements. Experiments show that current models perform poorly, with the best overall IA-QTF1 score reaching only 0.368.
OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains
OmniVideo-100K introduces an automated data engine with entity-anchored scripting and clue-guided QA generation to improve audio-visual reasoning and temporal consistency, achieving significant performance gains across multiple benchmarks.
TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation
TeachObs introduces a human-validated benchmark for multimodal teaching observation, consisting of 30 classroom videos annotated with segment-level binary codes and lesson-level expert ratings, and evaluates five frontier LLMs across three tracks, finding no single model consistently outperforms and that model evaluations overrate procedurally clear lessons.