Quantitative Video World Model Evaluation for Geometric-Consistency

Hugging Face Daily Papers 05/14/26, 12:00 AM Papers

Summary

A quantitative framework called PDI-Bench is introduced for evaluating geometric coherence in generated videos through monocular reconstruction and projective-geometry residuals, revealing geometry-specific failure modes in video generators.

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.

Original Article

View Cached Full Text

Cached at: 05/15/26, 04:23 AM

Paper page - Quantitative Video World Model Evaluation for Geometric-Consistency

Source: https://huggingface.co/papers/2605.15185

Abstract

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtainobject-centric observationsviasegmentationandpoint tracking(e.g.,SAM 2,MegaSaM, andCoTracker3), lift them to 3D world-space coordinates viamonocular reconstruction, and compute a set ofprojective-geometry residualscapturing three failure dimensions: scale-depth alignment,3D motion consistency, and3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.

View arXiv page View PDF Project page GitHub2 Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.15185 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.15185 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.15185 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Quantitative Video World Model Evaluation for Geometric-Consistency

Paper page - Quantitative Video World Model Evaluation for Geometric-Consistency

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Towards Consistent Video Geometry Estimation

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Submit Feedback

Similar Articles

Towards Consistent Video Geometry Estimation

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning