VCIFBench: Evaluating Complex Instruction Following for Video Understanding

arXiv cs.CL Papers

Summary

VCIFBench is a new benchmark for evaluating complex instruction following in video understanding, featuring 306 test instructions with content, format, style, and structure constraints, plus a DPO preference dataset. Experiments on 10 MLLMs reveal that joint constraint satisfaction remains challenging, and DPO training on the benchmark data improves instruction-following performance.

arXiv:2606.04588v1 Announce Type: new Abstract: Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We introduce VCIFBench, a benchmark for evaluating complex instruction following in video understanding. VCIFBench constructs constraint-rich instructions from both benchmark-adapted and directly video-grounded prompts, covering content, format, style, and structure requirements, and evaluates model outputs with a hybrid verification pipeline. The benchmark contains 306 satisfiable test instructions, a 540-pair DPO preference dataset, and a 30-item conflict diagnostic subset. Experiments on 10 MLLMs show that joint constraint satisfaction remains challenging. We further show that DPO training on VCIFBench data can improve instruction-following performance.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:15 AM

# VCIFBench: Evaluating Complex Instruction Following for Video Understanding
Source: [https://arxiv.org/abs/2606.04588](https://arxiv.org/abs/2606.04588)
[View PDF](https://arxiv.org/pdf/2606.04588)

> Abstract:Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints\. We introduce VCIFBench, a benchmark for evaluating complex instruction following in video understanding\. VCIFBench constructs constraint\-rich instructions from both benchmark\-adapted and directly video\-grounded prompts, covering content, format, style, and structure requirements, and evaluates model outputs with a hybrid verification pipeline\. The benchmark contains 306 satisfiable test instructions, a 540\-pair DPO preference dataset, and a 30\-item conflict diagnostic subset\. Experiments on 10 MLLMs show that joint constraint satisfaction remains challenging\. We further show that DPO training on VCIFBench data can improve instruction\-following performance\.

## Submission history

From: Huangchen Xu \[[view email](https://arxiv.org/show-email/a3eb99fb/2606.04588)\] **\[v1\]**Wed, 3 Jun 2026 08:27:53 UTC \(4,159 KB\)

Similar Articles

CoVEBench: Can Video Editing Models Handle Complex Instructions?

Hugging Face Daily Papers

Introduces CoVEBench, a new benchmark for evaluating compositional video editing capabilities, addressing limitations in handling complex multi-step instructions. The benchmark includes 416 videos, 626 instructions, and 9,990 checklist items, revealing that current models struggle with compositional editing tasks.

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

Hugging Face Daily Papers

Introduces SVI-Bench, a large-scale benchmark for strategic video intelligence using team sports, designed to evaluate models on dynamic scene understanding, causal reasoning, strategic simulation, and agentic synthesis. The benchmark reveals a capability cliff where models perform well on perceptual tasks but sharply degrade on higher-level strategic reasoning.

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Hugging Face Daily Papers

VEFX-Bench introduces a large-scale human-annotated video editing dataset (5,049 examples) with multi-dimensional quality labels and a specialized reward model for standardized evaluation of video editing systems. The paper addresses the lack of comprehensive benchmarks in AI-assisted video creation by providing VEFX-Dataset, VEFX-Reward, and a 300-video-prompt benchmark that reveals gaps in current editing models.

Benchmarking Visual State Tracking in Multimodal Video Understanding

Hugging Face Daily Papers

Introduces VSTAT, a benchmark for evaluating visual state tracking in multimodal large language models (MLLMs) using 834 clips and 1,500 questions. Current MLLMs perform poorly compared to humans, failing at visual perception rather than reasoning.