Quantitative Video World Model Evaluation for Geometric-Consistency

Hugging Face Daily Papers Papers

Summary

A quantitative framework called PDI-Bench is introduced for evaluating geometric coherence in generated videos through monocular reconstruction and projective-geometry residuals, revealing geometry-specific failure modes in video generators.

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.
Original Article
View Cached Full Text

Cached at: 05/15/26, 04:23 AM

Paper page - Quantitative Video World Model Evaluation for Geometric-Consistency

Source: https://huggingface.co/papers/2605.15185

Abstract

A quantitative framework called PDI-Bench is introduced for evaluating geometric coherence in generated videos through monocular reconstruction and projective-geometry residuals, revealing geometry-specific failure modes in video generators.

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtainobject-centric observationsviasegmentationandpoint tracking(e.g.,SAM 2,MegaSaM, andCoTracker3), lift them to 3D world-space coordinates viamonocular reconstruction, and compute a set ofprojective-geometry residualscapturing three failure dimensions: scale-depth alignment,3D motion consistency, and3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.

View arXiv pageView PDFProject pageGitHub2Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.15185 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.15185 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.15185 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Towards Consistent Video Geometry Estimation

Hugging Face Daily Papers

ViGeo is a transformer-based foundation model that recovers dense and consistent 3D geometry from videos using dynamic chunking attention and a completion-based data refinement framework, achieving state-of-the-art performance across multiple tasks.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

Hugging Face Daily Papers

CRONOS is a benchmark that evaluates counterfactual physical consistency in video prediction models by intervening on viewpoint, scene, object category, and appearance while keeping physical event types fixed. It reveals substantial failures in current video generators.