Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation
Summary
Physics Question Scene Graph (PQSG) is a hierarchical question-based pipeline using VLMs to evaluate video generation models' physical plausibility with fine-grained violation detection. It introduces the FinePhyEval dataset and shows higher correlation with human judgments than prior work.
View Cached Full Text
Cached at: 06/26/26, 02:04 AM
Paper page - Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation
Source: https://huggingface.co/papers/2606.25306
Abstract
A vision-language model-based hierarchical question graph framework evaluates video generation models’ adherence to physical laws with granular violation detection and human correlation validation.
Video generation modelsare increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basicphysical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in videos. We address this by introducing Physics QuestionScene Graph(PQSG), a hierarchicalquestion-based evaluationpipeline. PQSG evaluates generated videos by checking their faithfulness to a prompt across objects, actions, and adherence tophysical lawsusing a graph-based hierarchy of questions generated by avision-language model(VLM), guided by high-quality in-context examples. By representing questions as a graph, PQSG introduceslogical dependencieswithin questions, ensuring that each query is contextually valid. Moreover, PQSG provides granular assessments of which qualities of the video violate physical plausibility constraints. We validate PQSG by creatingFinePhyEval, a dataset with physics-based prompts and corresponding generated videos from diverse state-of-the-artvideo generation models(Sora 2,Veo 3, andWan 2.1), with each video annotated across multiple categories by humans. UsingFinePhyEval, we measure the correlation between PQSG’s fine-grained scores and human judgments, showing higher overall correlations than prior work. We also find that PQSG ranks closed-source models higher thanWan 2.1on physical realism. Lastly, we show that the annotations we provide inFinePhyEvalcan also be used for subtask evaluation: we benchmark two strong VLMs on generating and answering questions, finding that while models can create human-like questions, they still fall short of human performance in answering them.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2606\.25306
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.25306 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.25306 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.25306 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Physics-IQ Verified
This paper presents a systematic audit of the Physics-IQ benchmark for evaluating physical understanding in video generative models, proposing improvements to prompts and scoring to enhance reliability.
PhyDrawGen: Physically Grounded Diagram Generation from Natural Language
PhyDrawGen is a neuro-symbolic pipeline that generates physically accurate diagrams from natural language by combining LLM-based scene understanding with a deterministic constraint solver and a VLM-based verify loop, outperforming existing models on a benchmark of physics problems.
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
This paper audits multimodal physics evaluation pipelines, revealing issues like train-eval contamination, translation drift, and MCQ saturation. It releases new datasets (PhysCorp-A, PhysR1Corp, PhysOlym-A) and a training recipe (Physics-R1) that significantly improves performance on held-out olympiad problems.
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
PhyMotion proposes a physics-grounded reward system that evaluates kinematic plausibility, contact consistency, and dynamic feasibility of human motion in generated videos, achieving stronger correlation with human judgment and improving motion realism in RL-based post-training.
Quantitative Video World Model Evaluation for Geometric-Consistency
A quantitative framework called PDI-Bench is introduced for evaluating geometric coherence in generated videos through monocular reconstruction and projective-geometry residuals, revealing geometry-specific failure modes in video generators.