A Very Big Video Reasoning Suite
Summary
This paper introduces the Very Big Video Reasoning (VBVR) dataset and benchmark, a large-scale resource with over one million video clips across 200 reasoning tasks, enabling systematic study of spatiotemporal reasoning and showing early signs of emergent generalization.
View Cached Full Text
Cached at: 05/26/26, 12:36 PM
Paper page - A Very Big Video Reasoning Suite
Source: https://huggingface.co/papers/2602.20159 Published on Feb 23
#1 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
Abstract
A large-scale video reasoning dataset and benchmark are introduced to study video intelligence capabilities beyond visual quality, enabling systematic analysis of spatiotemporal reasoning and generalization across diverse tasks.
Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored.Video reasoninggrounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studyingvideo reasoningand its scaling behavior is hindered by the lack of large-scale training data. To address this gap, we introduce the Very BigVideo Reasoning(VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis ofvideo reasoningcapabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies ofvideo reasoningand observe early signs ofemergent generalizationto unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizablevideo reasoning. The data, benchmark toolkit, and models are publicly available at https://video-reason.com/ .
View arXiv pageView PDFProject pageGitHub189autoAdd to collection
Get this paper in your agent:
hf papers read 2602\.20159
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper10
#### Video-Reason/VBVR-Wan2.2 Image-to-Video• UpdatedApr 15 • 223 • 129
#### Video-Reason/VBVR-LTX2.3-diffsynth Image-to-Video• UpdatedApr 14 • 336 • 22
#### Video-Reason/VBVR-Wan2.1-diffsynth Image-to-Video• UpdatedApr 14 • 37 • 6
#### Video-Reason/VBVR-Wan2.2-diffsynth Image-to-Video• UpdatedApr 14 • 420 • 5
Browse 10 models citing this paper## Datasets citing this paper4
#### Video-Reason/VBVR-Dataset Viewer• UpdatedApr 1 • 1M • 2.96k • 54 #### Video-Reason/VBVR-Bench-Data Viewer• UpdatedApr 1 • 500 • 1.32k • 9 #### Video-Reason/video-mcp Viewer• UpdatedApr 1 • 3.91k • 863 • 2 #### abs794/VBVR-Bench-Data Viewer• UpdatedFeb 24 • 500 • 427
Spaces citing this paper1
Collections including this paper18
Similar Articles
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
VGenST-Bench is a benchmark that uses generative models to actively synthesize controlled spatio-temporal reasoning scenarios, with a multi-agent pipeline and human quality control, to evaluate multimodal large language models.
SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence
Introduces SVI-Bench, a large-scale benchmark for strategic video intelligence using team sports, designed to evaluate models on dynamic scene understanding, causal reasoning, strategic simulation, and agentic synthesis. The benchmark reveals a capability cliff where models perform well on perceptual tasks but sharply degrade on higher-level strategic reasoning.
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR is a research paper proposing a closed-loop framework that collaboratively integrates vision-language models with video generation models to improve visual reasoning and correct failures in real-time.
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
This paper introduces WorldReasonBench and WorldRewardBench, new benchmarks designed to evaluate video generation models' ability to reason about world-state evolution and physical consistency. The research highlights a gap between visual plausibility and true logical reasoning in current commercial video generators.