Quantitative Video World Model Evaluation for Geometric-Consistency
Summary
A quantitative framework called PDI-Bench is introduced for evaluating geometric coherence in generated videos through monocular reconstruction and projective-geometry residuals, revealing geometry-specific failure modes in video generators.
View Cached Full Text
Cached at: 05/15/26, 04:23 AM
Paper page - Quantitative Video World Model Evaluation for Geometric-Consistency
Source: https://huggingface.co/papers/2605.15185
Abstract
A quantitative framework called PDI-Bench is introduced for evaluating geometric coherence in generated videos through monocular reconstruction and projective-geometry residuals, revealing geometry-specific failure modes in video generators.
Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtainobject-centric observationsviasegmentationandpoint tracking(e.g.,SAM 2,MegaSaM, andCoTracker3), lift them to 3D world-space coordinates viamonocular reconstruction, and compute a set ofprojective-geometry residualscapturing three failure dimensions: scale-depth alignment,3D motion consistency, and3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.
View arXiv pageView PDFProject pageGitHub2Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.15185 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.15185 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.15185 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Towards Consistent Video Geometry Estimation
ViGeo is a transformer-based foundation model that recovers dense and consistent 3D geometry from videos using dynamic chunking attention and a completion-based data refinement framework, achieving state-of-the-art performance across multiple tasks.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models
CRONOS is a benchmark that evaluates counterfactual physical consistency in video prediction models by intervening on viewpoint, scene, object category, and appearance while keeping physical event types fixed. It reveals substantial failures in current video generators.
WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
WBench is a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns, providing automatic sub-metrics and diagnostic insights. It reveals that no single model excels across all dimensions.
Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
GeoVR enhances multimodal large language models with 3D awareness by restructuring their semantic latent space through geometric knowledge distillation from 3D foundation models using multiple geometric targets.
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
This paper proposes GASP, a framework that injects geometric priors into vision-language models via deep supervision with contrastive and depth consistency losses, achieving significant improvements on 3D spatial reasoning benchmarks without using 3D VQA data.