VCIFBench: Evaluating Complex Instruction Following for Video Understanding

arXiv cs.CL 06/04/26, 04:00 AM Papers

benchmark video-understanding instruction-following multimodal-llm evaluation dpo

Summary

VCIFBench is a new benchmark for evaluating complex instruction following in video understanding, featuring 306 test instructions with content, format, style, and structure constraints, plus a DPO preference dataset. Experiments on 10 MLLMs reveal that joint constraint satisfaction remains challenging, and DPO training on the benchmark data improves instruction-following performance.

arXiv:2606.04588v1 Announce Type: new Abstract: Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We introduce VCIFBench, a benchmark for evaluating complex instruction following in video understanding. VCIFBench constructs constraint-rich instructions from both benchmark-adapted and directly video-grounded prompts, covering content, format, style, and structure requirements, and evaluates model outputs with a hybrid verification pipeline. The benchmark contains 306 satisfiable test instructions, a 540-pair DPO preference dataset, and a 30-item conflict diagnostic subset. Experiments on 10 MLLMs show that joint constraint satisfaction remains challenging. We further show that DPO training on VCIFBench data can improve instruction-following performance.

Original Article

View Cached Full Text

Cached at: 06/05/26, 02:15 AM

# VCIFBench: Evaluating Complex Instruction Following for Video Understanding
Source: [https://arxiv.org/abs/2606.04588](https://arxiv.org/abs/2606.04588)
[View PDF](https://arxiv.org/pdf/2606.04588)

> Abstract:Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints\. We introduce VCIFBench, a benchmark for evaluating complex instruction following in video understanding\. VCIFBench constructs constraint\-rich instructions from both benchmark\-adapted and directly video\-grounded prompts, covering content, format, style, and structure requirements, and evaluates model outputs with a hybrid verification pipeline\. The benchmark contains 306 satisfiable test instructions, a 540\-pair DPO preference dataset, and a 30\-item conflict diagnostic subset\. Experiments on 10 MLLMs show that joint constraint satisfaction remains challenging\. We further show that DPO training on VCIFBench data can improve instruction\-following performance\.

## Submission history

From: Huangchen Xu \[[view email](https://arxiv.org/show-email/a3eb99fb/2606.04588)\] **\[v1\]**Wed, 3 Jun 2026 08:27:53 UTC \(4,159 KB\)

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

Similar Articles

CoVEBench: Can Video Editing Models Handle Complex Instructions?

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Benchmarking Visual State Tracking in Multimodal Video Understanding

Submit Feedback

Similar Articles

CoVEBench: Can Video Editing Models Handle Complex Instructions?

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Benchmarking Visual State Tracking in Multimodal Video Understanding