VCIFBench: Evaluating Complex Instruction Following for Video Understanding
Summary
VCIFBench is a new benchmark for evaluating complex instruction following in video understanding, featuring 306 test instructions with content, format, style, and structure constraints, plus a DPO preference dataset. Experiments on 10 MLLMs reveal that joint constraint satisfaction remains challenging, and DPO training on the benchmark data improves instruction-following performance.
View Cached Full Text
Cached at: 06/05/26, 02:15 AM
# VCIFBench: Evaluating Complex Instruction Following for Video Understanding Source: [https://arxiv.org/abs/2606.04588](https://arxiv.org/abs/2606.04588) [View PDF](https://arxiv.org/pdf/2606.04588) > Abstract:Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints\. We introduce VCIFBench, a benchmark for evaluating complex instruction following in video understanding\. VCIFBench constructs constraint\-rich instructions from both benchmark\-adapted and directly video\-grounded prompts, covering content, format, style, and structure requirements, and evaluates model outputs with a hybrid verification pipeline\. The benchmark contains 306 satisfiable test instructions, a 540\-pair DPO preference dataset, and a 30\-item conflict diagnostic subset\. Experiments on 10 MLLMs show that joint constraint satisfaction remains challenging\. We further show that DPO training on VCIFBench data can improve instruction\-following performance\. ## Submission history From: Huangchen Xu \[[view email](https://arxiv.org/show-email/a3eb99fb/2606.04588)\] **\[v1\]**Wed, 3 Jun 2026 08:27:53 UTC \(4,159 KB\)
Similar Articles
CoVEBench: Can Video Editing Models Handle Complex Instructions?
Introduces CoVEBench, a new benchmark for evaluating compositional video editing capabilities, addressing limitations in handling complex multi-step instructions. The benchmark includes 416 videos, 626 instructions, and 9,990 checklist items, revealing that current models struggle with compositional editing tasks.
OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning
Introduces OmniCap-IF, the first comprehensive benchmark for evaluating instruction-following in omni-modal video captioning, revealing a format-content tradeoff and proposing improved models and datasets.
SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence
Introduces SVI-Bench, a large-scale benchmark for strategic video intelligence using team sports, designed to evaluate models on dynamic scene understanding, causal reasoning, strategic simulation, and agentic synthesis. The benchmark reveals a capability cliff where models perform well on perceptual tasks but sharply degrade on higher-level strategic reasoning.
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
VEFX-Bench introduces a large-scale human-annotated video editing dataset (5,049 examples) with multi-dimensional quality labels and a specialized reward model for standardized evaluation of video editing systems. The paper addresses the lack of comprehensive benchmarks in AI-assisted video creation by providing VEFX-Dataset, VEFX-Reward, and a 300-video-prompt benchmark that reveals gaps in current editing models.
Benchmarking Visual State Tracking in Multimodal Video Understanding
Introduces VSTAT, a benchmark for evaluating visual state tracking in multimodal large language models (MLLMs) using 834 clips and 1,500 questions. Current MLLMs perform poorly compared to humans, failing at visual perception rather than reasoning.