Physically Viable World Models: A Case for Query-Conditioned Embodied AI

arXiv cs.AI Papers

Summary

This paper argues that world models for embodied AI must be physically viable and query-conditioned, focusing on identifying the simplest physical abstraction for each intervention query rather than merely predicting observations.

arXiv:2605.30542v1 Announce Type: new Abstract: World models for embodied AI must be physically viable: constructed to answer intervention queries by representing the physical structure governing action outcomes, rather than merely predicting future observations. Existing observation-predictive world models can produce visually plausible but physically wrong rollouts. This failure is structural; distinct physical systems can look identical yet diverge under intervention. We expose this problem with controlled benchmarks that fix the visible scene while varying latent physics. We show that such models may recommend infeasible actions, mispredict interaction outcomes, or certify unsafe behavior. We argue that embodied AI requires world models that identify the simplest physical abstraction sufficient to answer an intervention query. Such a model comprises modular components, including environment representation, latent state and parameter estimation, action specification, interventional dynamics, and query-level response. An autonomous orchestrator should identify the relevant abstraction and compose compatible learned and structured components per query. When closed-form physics is unavailable, uncertain, or costly, the transition model may be analytic, simulated, learned, or hybrid, but it must preserve the structure that determines interventional outcomes. This decomposition makes the model interpretable, its components verifiable, and its outputs auditable against the query. It also provides a design principle for new world models and a feasibility test for existing ones: the right abstraction is not the most detailed model of the world, but the simplest model that preserves the distinctions relevant to the query. We demonstrate this approach on queries that existing systems fail to answer correctly, and outline how an orchestrator can dynamically assemble and adapt physically viable models for planning, control, and verification.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:23 AM

# Physically Viable World Models: A Case for Query-Conditioned Embodied AI
Source: [https://arxiv.org/html/2605.30542](https://arxiv.org/html/2605.30542)
Adam J\. ThorpeStepan Tretiakov11footnotemark:1Cheng\-Hsi HsiaoSu Ann LowXingjian Li Hassan IqbalNeel P\. BhattUfuk TopcuKrishna Kumar The University of Texas at Austin adam\.thorpe@austin\.utexas\.edustepan@utexas\.educhhsiao@utexas\.edu suann@utexas\.eduxingjian\.li@austin\.utexas\.eduhassan\.iqbal@utexas\.edu npbhatt@utexas\.eduutopcu@utexas\.edukrishnak@utexas\.edu

###### Abstract

World models for embodied AI must be physically viable: constructed to answer intervention queries by representing the underlying physical structure that governs the outcomes of an agent’s action, rather than merely serving as generic predictors of future observations\. Existing world models, trained to predict observations, can produce visually plausible rollouts that are physically incorrect\. This failure is structural; different physical systems can produce identical observations yet behave differently under intervention\. We expose this problem through controlled benchmarks that keep the visible scene fixed while varying the underlying physics\. We demonstrate that observation\-predictive models may recommend infeasible actions, mispredict interaction outcomes, or certify behaviors that would otherwise be unsafe in the real world\. We argue that embodied AI requires world models that identify the simplest physical abstraction sufficient to answer a given intervention query\. Such a physically viable world model is composed from modular components, including environment representation, latent state and parameter estimation, action specification, dynamics under intervention, and response to queries\. An autonomous orchestrator should identify the relevant abstraction and compose the world model from compatible learned and structured components for each query\. The transition model may be analytic, simulated, learned, or hybrid when closed\-form physics is unavailable, uncertain, or computationally expensive, but it must preserve the physical structure that determines the outcome of interventions\. The resulting modular decomposition makes the model interpretable, its components verifiable, and its outputs auditable against the query\. It also provides both a design principle for new world models and a feasibility test for existing ones: the right abstraction is not the most detailed model of the world, but the simplest model that preserves the distinctions relevant to the query\. We demonstrate how this approach could work in practice on intervention queries that existing systems fail to answer correctly, and outline how an orchestrator can dynamically assemble and adapt physically viable models for planning, control, and verification\.

![Refer to caption](https://arxiv.org/html/2605.30542v1/x1.png)Figure 1:Visual world models can produce visually plausible but physically impossible predictions\. We argue that embodied AI therefore requires world models that identify the simplest physical abstraction sufficient to answer a given intervention query\.## 1Introduction

A world model for embodied AI is physically viable when it supports correct reasoning about how a physical system evolves under intervention\. For embodied AI, the relevant future is not just a sequence of observations, but the evolution of a physical system under intervention\. A model used for planning, control, counterfactual reasoning, or safety analysis must therefore preserve the physical distinctions that affect those decisions\. Visual plausibility does not guarantee this property\. A rollout may look realistic while using invalid dynamics or omitting variables that the action depends on\. Embodied world modeling therefore requires query\-conditioned physically viable world models: models whose state variables, dynamics, parameters, and constraints are sufficient for the intervention being evaluated\. Scaling observation prediction alone does not guarantee this requirement, especially when the variables that determine the outcome are latent, unobserved, or revealed only through action\. As illustrated in Figure[1](https://arxiv.org/html/2605.30542#S0.F1), the same intervention query may require reasoning over contact, mass, fluid response, or stability constraints that are not identifiable from visual appearance alone\.

A query specifies the intervention, the task, and the standard of correctness the prediction must satisfy\. These elements determine the abstraction the model must construct: which variables, dynamics, parameters, constraints, and level of fidelity are required to answer the query\. Physical viability does not require the most detailed model of the world\. It requires the simplest abstraction that preserves the distinctions relevant to the query\. For example, a grasping task may require contact and friction, a pouring task may require volume transfer and conservation laws, and a safety query may require reachability or barrier constraints without requiring photorealistic rendering\. The right world model is therefore not the most realistic model in general, but the model whose variables, equations, parameters, and constraints are sufficient for the intervention being considered\.

Current observation\-predictive world models fail when their training signal does not identify the dynamics needed for action\. Vision\-language models, video generators, and latent predictive models can match perceptual regularities while failing on the latent physical variables that determine intervention outcomesChowet al\.\([2025](https://arxiv.org/html/2605.30542#bib.bib4)\); Kanget al\.\([2024](https://arxiv.org/html/2605.30542#bib.bib3)\); Menget al\.\([2024](https://arxiv.org/html/2605.30542#bib.bib5)\); Guoet al\.\([2025](https://arxiv.org/html/2605.30542#bib.bib7)\); Guet al\.\([2025](https://arxiv.org/html/2605.30542#bib.bib8)\); Motamedet al\.\([2026](https://arxiv.org/html/2605.30542#bib.bib22)\); Zhanget al\.\([2025](https://arxiv.org/html/2605.30542#bib.bib9),[2026a](https://arxiv.org/html/2605.30542#bib.bib11)\)\. The issue is structural rather than architectural\. The same observations can fit multiple physical systems that behave differently when acted on\. More passive data may improve visual realism and short\-horizon prediction without resolving the distinctions that determine the intervention outcome\.

We argue that world modeling for embodied AI must shift from observation extrapolation to query\-conditioned construction of physically viable models\.This position does not require maximal physical detail, but explicit selection of the abstraction needed to answer the query\. A physically viable model must represent the variables the intervention acts on, use compatible dynamics and constraints, and return the form of answer the query requires\. This view separates roles that end\-to\-end predictors often conflate, including perception, abstraction, parameter estimation, dynamics, and query\-level response\. World models should therefore be judged not strictly by perceptual realism, but by whether their abstractions support downstream decisions under intervention\. We support this position with controlled examples that expose the failure modes of visual world models and with illustrative constructions that show how query\-conditioned physical abstractions could address them\.

Contributions:This paper makes three contributions\.\(1\)We argue observation\-predictive world models are structurally inadequate for embodied intervention, and embodied world models should be constructed per query, around the physical distinctions the intervention depends on\.\(2\)We demonstrate the resulting failures in three model families \(vision\-language, video diffusion, and action\-conditioned latent prediction\) under controlled physical variation\.\(3\)We define a modular design framework for query\-conditioned construction of a physically viable world model, and specify the operations an orchestrator must perform to compose it\.

## 2How Current World Models Fail to Represent Physics

Current world models are often optimized to predict future observations from past observations, producing visually coherent predictions that fail under intervention\. We use the simulations as an evaluation suite: static and counterfactual VLM prediction, diffusion video continuation, and action\-conditioned latent control\. Across these tests, appearance or action is held nearly fixed while latent physics varies; models often give plausible explanations, videos, or actions without preserving the variables that determine the outcome\. Full protocols and results are in Appendix[B\.1](https://arxiv.org/html/2605.30542#A2.SS1), Appendix[B\.2](https://arxiv.org/html/2605.30542#A2.SS2), and Appendix[B\.3](https://arxiv.org/html/2605.30542#A2.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.30542v1/x2.png)Figure 2:Controlled evaluation scenes used to expose failures of visual world models under latent physical variation, including rigid\-body collision, deformable interaction, rigid–fluid coupling, contact\-dependent pushing, and viscosity\-dependent pouring\.### 2\.1Simulation suite and tests

We evaluate models on controlled simulations with specified physical parameters, interventions, and reference rollouts \(Figure[2](https://arxiv.org/html/2605.30542#S2.F2)\)\. Each scene family isolates a query\-relevant latent variable while holding appearance or action nearly fixed\. The suite supports the three tests summarized above and detailed in Appendix[B\.1](https://arxiv.org/html/2605.30542#A2.SS1), Appendix[B\.2](https://arxiv.org/html/2605.30542#A2.SS2), and Appendix[B\.3](https://arxiv.org/html/2605.30542#A2.SS3)\.

Ramp\-to\-tower rigid\-body interactions\.These scenes test whether a model tracks latent rigid\-body properties rather than predicting a generic collision outcome from visual appearance alone\. A ball rolls down a ramp and collides with a small block tower\. We use two closely related setups\. In the density\-variation setup, the tower geometry remains fixed while the block material changes, for example from wood to steel, so that the same apparent collision produces different momentum transfer and collapse behavior\. In the restitution setup, we introduce high\-restitution objects, including a bouncy projectile and composite towers that may contain a bouncy element above rigid lower blocks\. These variants are shown in[Figure˜8](https://arxiv.org/html/2605.30542#A2.F8)\. They probe whether predictions account for mass, inertia, restitution, frictional dissipation, and contact ordering\.

Deformable\-jelly\-wall interactions\.We replace the rigid block tower with one or two deformable jelly walls and vary the release distance\. Visually similar collisions can produce deformation, sliding, tipping, or momentum transfer through multiple deformable bodies depending on impact energy and material response\. Correct prediction therefore requires tracking deformation, energy dissipation, and sequential contact dynamics\. These scenes test whether predictions capture compliance and sequential deformable\-body interaction\. The static VLM benchmark includes the single\-jelly\-wall and two\-jelly\-wall settings in[Figure˜7](https://arxiv.org/html/2605.30542#A2.F7), while[Figure˜10](https://arxiv.org/html/2605.30542#A2.F10)compares the physical rollout against a diffusion\-generated continuation for the two\-jelly\-wall case\.

Ramp\-to\-liquid\-filled\-cup impact\.These scenes test whether models capture coupled rigid–fluid interaction\. A ball strikes a partially or fully liquid\-filled cup while geometry and camera position remain fixed\. We vary the ball material, release height, and fill level across trials\. The outcome depends on ball momentum, cup motion, liquid inertia, sloshing, spillage, and the combined cup–fluid center of mass\. The liquid\-filled\-cup setting appears in the static VLM benchmark in[Figure˜7](https://arxiv.org/html/2605.30542#A2.F7);[Figure˜9](https://arxiv.org/html/2605.30542#A2.F9)further compares the physically simulated rigid–fluid interaction against a diffusion\-generated video continuation\.

Robot\-wall pushing\.A Franka Panda end\-effector follows the same horizontal pushing trajectory into a freestanding wall while contact height or floor friction is varied\. High and low contact points separate tipping torque from translation, while friction variation separates sliding from overturning\. This tests whether action prediction is conditioned on the physical contact regime\. Representative high\-push, low\-push, and material\-variation rollouts are shown in[Figures˜11](https://arxiv.org/html/2605.30542#A2.F11)and[12](https://arxiv.org/html/2605.30542#A2.F12), where visually selected trajectories are compared against simulator\-grounded executions\.

Robot\-arm pouring and viscosity variants\.These scenes test whether models infer latent fluid properties that alter the correct action\. A robot pours from one glass into another under fixed geometry and controlled motion\. We compare water\-like, honey\-like, and synthetic\-viscosity liquids\. Viscosity changes flow rate, transfer timing, retained liquid, and spill behavior, so the correct action may require a different hold duration or parameter\-identification step\. Candidate viscosities are evaluated against query\-relevant quantities such as receiver fill, residual liquid, and spillage\. The viscosity\-dependent pouring variants and best\-match parameter\-estimation result are shown in[Figure˜13](https://arxiv.org/html/2605.30542#A2.F13)\.

Together, these scenes cover rigid impact, restitution, deformable interaction, rigid–fluid coupling, contact\-rich pushing, and viscosity\-dependent pouring\. They instantiate the VLM, diffusion, latent\-control, and viscosity\-estimation tests reported in the appendix\. The simulation code is available at[https://github\.com/pvwm/physically\-viable\-world\-models](https://github.com/pvwm/physically-viable-world-models), and supplementary videos of simulator rollouts, diffusion\-generated video continuations, and V\-JEPA action\-conditioned rollouts are available on the project website:[https://pvwm\.github\.io/](https://pvwm.github.io/)\.

### 2\.2Why the failures are structural

The failures above are not specific to one architecture\. The tests in[Sections˜B\.1](https://arxiv.org/html/2605.30542#A2.SS1),[B\.2](https://arxiv.org/html/2605.30542#A2.SS2)and[B\.3](https://arxiv.org/html/2605.30542#A2.SS3)show the same pattern in different forms: VLMs identify relevant effects but miss thresholded outcomes; diffusion rollouts remain visually coherent while violating contact, fluid, or deformable dynamics; and latent\-control plans can be visually plausible but physically infeasible\. VLMs mediate physical knowledge through image\-text priors and token\-level reasoning, without an explicit state that evolves under action\. Video diffusion models represent state as image sequences, so physical variables exist only insofar as they are recoverable from pixel statistics\. Latent world models learn transition functions in representations optimized for predictionHa and Schmidhuber \([2018b](https://arxiv.org/html/2605.30542#bib.bib96),[a](https://arxiv.org/html/2605.30542#bib.bib1)\); Hafneret al\.\([2019b](https://arxiv.org/html/2605.30542#bib.bib2),[a](https://arxiv.org/html/2605.30542#bib.bib56),[2021](https://arxiv.org/html/2605.30542#bib.bib29)\); Chenet al\.\([2022](https://arxiv.org/html/2605.30542#bib.bib31)\); Hafneret al\.\([2023](https://arxiv.org/html/2605.30542#bib.bib94)\); Denget al\.\([2023](https://arxiv.org/html/2605.30542#bib.bib39)\); Hafneret al\.\([2025](https://arxiv.org/html/2605.30542#bib.bib10)\); Schrittwieseret al\.\([2020](https://arxiv.org/html/2605.30542#bib.bib28)\); Micheliet al\.\([2023](https://arxiv.org/html/2605.30542#bib.bib33)\); Hansenet al\.\([2024](https://arxiv.org/html/2605.30542#bib.bib38)\), but these latents need not correspond to physical state variables\.

This limitation follows from the training objective\. Prediction over observation sequences asks the model to infer dynamics from observed data, but this inverse problem is not unique\. Distinct physical systems can induce identical or nearly identical observations when the variables that determine evolution, such as mass, friction, compliance, inertia, restitution, viscosity, or contact state, are latent or indirectly observed\. This is not a claim that dynamics can never be identified\. With full state access, rich interventions, targeted system\-identification data, or strong physical priors, the relevant system may be recoverable\. The issue is that current training regimes often lack this information and optimize for observed outcomes rather than recovering the variables needed under intervention\.

Our tests make the ambiguity explicit: appearance and actions stay nearly fixed while the correct response changes with density, restitution, deformability, contact height, friction, rigid–fluid coupling, or viscosity\. Without representing, estimating, or probing these quantities, models follow visual similarity rather than intervention\-relevant dynamics\. Large\-scale studies show the same pattern: in controlled mechanics environments, out\-of\-distribution error can remain dominated by visual similarity rather than dynamical variablesKanget al\.\([2024](https://arxiv.org/html/2605.30542#bib.bib3)\), and video generation models can violate physical laws despite large\-scale training and diverse evaluation settingsMenget al\.\([2024](https://arxiv.org/html/2605.30542#bib.bib5)\); Guoet al\.\([2025](https://arxiv.org/html/2605.30542#bib.bib7)\); Guet al\.\([2025](https://arxiv.org/html/2605.30542#bib.bib8)\)\.

Action\-conditioned models reduce ambiguity only over the distribution of actions and regimes they observe\. They can still fail when extrapolating to unseen contact modes, material properties, or control settingsHafneret al\.\([2019b](https://arxiv.org/html/2605.30542#bib.bib2),[2025](https://arxiv.org/html/2605.30542#bib.bib10)\); Guptaet al\.\([2023](https://arxiv.org/html/2605.30542#bib.bib46)\); Janneret al\.\([2022](https://arxiv.org/html/2605.30542#bib.bib47)\); Dinget al\.\([2024](https://arxiv.org/html/2605.30542#bib.bib48)\); Rigteret al\.\([2024](https://arxiv.org/html/2605.30542#bib.bib49)\); Huanget al\.\([2025a](https://arxiv.org/html/2605.30542#bib.bib50)\)\. Likewise, existing approaches mitigate the problem without eliminating it\. Physics\-informed losses and architectural priors constrain predictions but are often soft regularizersRaissiet al\.\([2019](https://arxiv.org/html/2605.30542#bib.bib12)\); Greydanuset al\.\([2019](https://arxiv.org/html/2605.30542#bib.bib88)\); Cranmeret al\.\([2020a](https://arxiv.org/html/2605.30542#bib.bib89),[b](https://arxiv.org/html/2605.30542#bib.bib13)\); Beucleret al\.\([2021](https://arxiv.org/html/2605.30542#bib.bib99)\); object\-centric models recover useful structure but still learn interaction dynamics from dataKipfet al\.\([2020](https://arxiv.org/html/2605.30542#bib.bib76)\); Locatelloet al\.\([2020](https://arxiv.org/html/2605.30542#bib.bib75)\); Liuet al\.\([2023](https://arxiv.org/html/2605.30542#bib.bib71)\); scientific surrogates and learned simulators can be accurate within fixed regimes but typically assume predefined variables or governing dynamicsPathaket al\.\([2022](https://arxiv.org/html/2605.30542#bib.bib91)\); Lamet al\.\([2022](https://arxiv.org/html/2605.30542#bib.bib92)\); Biet al\.\([2022](https://arxiv.org/html/2605.30542#bib.bib93)\); Changet al\.\([2017](https://arxiv.org/html/2605.30542#bib.bib52)\); Battagliaet al\.\([2016](https://arxiv.org/html/2605.30542#bib.bib98)\); Liet al\.\([2019](https://arxiv.org/html/2605.30542#bib.bib53)\); Chenet al\.\([2018](https://arxiv.org/html/2605.30542#bib.bib97)\); Sanchez\-Gonzalezet al\.\([2020](https://arxiv.org/html/2605.30542#bib.bib54)\); Pfaffet al\.\([2021](https://arxiv.org/html/2605.30542#bib.bib55)\); and differentiable or hybrid simulators shift difficulty to representation, parameter estimation, and model selectionHuet al\.\([2020](https://arxiv.org/html/2605.30542#bib.bib57)\); Jatavallabhulaet al\.\([2021](https://arxiv.org/html/2605.30542#bib.bib58)\); Freemanet al\.\([2021](https://arxiv.org/html/2605.30542#bib.bib59)\); Xieet al\.\([2024](https://arxiv.org/html/2605.30542#bib.bib60)\); Abou\-Chakraet al\.\([2024](https://arxiv.org/html/2605.30542#bib.bib61)\); Zhanget al\.\([2024](https://arxiv.org/html/2605.30542#bib.bib6)\); Huanget al\.\([2025b](https://arxiv.org/html/2605.30542#bib.bib64)\); Liuet al\.\([2025](https://arxiv.org/html/2605.30542#bib.bib65)\); Chenet al\.\([2025](https://arxiv.org/html/2605.30542#bib.bib66)\); Zhou and Negrut \([2025](https://arxiv.org/html/2605.30542#bib.bib67)\)\. These tools are valuable, but they must be composed around the intervention query\.

The consequence appears under action\. Two systems that look the same over the training distribution can respond differently when intervened upon\. Reliable planning, control, and counterfactual reasoning therefore require physical structure to enter as represented, estimated, or enforced components, rather than only as an emergent property of observation prediction\.

## 3Physically viable world modeling

We now describe a constructive view of physically viable world models\. The central idea is that an embodied agent should begin with an intervention query, determine which physical distinctions the query depends on, and construct the simplest abstraction sufficient to preserve those distinctions\. We separate this process into two parts: an orchestrator that selects the relevant abstractions and compatible components for the query, and a world model that composes those choices into a model of the system under intervention\. Figure[3](https://arxiv.org/html/2605.30542#S3.F3)gives a representative construction for counterfactual, planning, verification, and parameter\-identification queries\. We then use three examples—a ramp\-and\-cup interaction, robot pouring with latent viscosity, and flooded\-road traversal—to show how the query determines variable selection, parameter estimation, action selection, dynamics, and constraints\.

### 3\.1Constructing the simplest abstraction sufficient for a query

![Refer to caption](https://arxiv.org/html/2605.30542v1/x3.png)Figure 3:Representative construction of a physically viable world model\. A user\-specified intervention query determines the physical abstraction to be constructed\. Perception and representation recover query\-relevant scene variables, action specification defines admissible interventions, dynamics and constraints evolve the selected state under action, and prediction returns the response required by the query\. The compatibility relation emphasizes that the selected representation, action interface, dynamics, constraints, and output must support the same abstraction\.Intervention queries\.A physically viable world model is constructed around the physical requirements determined by an intervention query\. In embodied settings, the query is a request about interaction with the world, such as what will happen if a particular action is taken, which action will produce a desired outcome, or whether an action remains safe under uncertainty\. The query defines the modeling problem because it specifies the intervention being considered, the outcome that matters, the physical uncertainty that must be resolved, and the form of the answer needed for decision\-making\. These elements determine the abstraction—the variables to represent, the parameters to estimate or bound, the actions to admit, the dynamics to use, the constraints to enforce, and the response the model should return\. A model that answers the query does not need to represent every physical detail of the scene\. This query\-conditioned construction is a design principle rather than a fixed architecture\. The appropriate abstraction is the one that preserves the physical distinctions that can change the answer\. If the answer depends on mass, friction, contact geometry, fill level, or viscosity, then those quantities must be represented, estimated, or bounded\. A richer model may be useful when it improves reuse, robustness, or safety margins, but additional detail also creates estimation and validation obligations\.

Variable selection\.Variable selection determines the state, latent parameters, and admissible actions required to answer the query\. In Figure[3](https://arxiv.org/html/2605.30542#S3.F3), perception provides evidence from available sensors, representation maps that evidence into physical variables, and action specification defines interventions in the same variable space\. These components must be compatible: an action is only meaningful if it acts on variables represented in the state, and the state is only useful if it exposes the quantities the action changes\. For example, a force applied at a contact point requires a representation containing geometry, pose, contact location, mass, and velocity, rather than raw pixel observations alone\. Different queries therefore require different abstractions\. Navigation may require free space and robot pose, grasping may require contact geometry and friction, and pouring may require fill level, container geometry, and fluid properties\. Variable selection must also account for latent quantities that cannot be identified from passive observation alone, including mass, friction, viscosity, compliance, or contact state\. When such quantities affect the intervention outcome, the model should estimate them, maintain uncertainty over possible values, gather additional information, or return a conditional response rather than collapsing unresolved uncertainty into a single unsupported prediction\.

State evolution and constraints\.After the state and action variables are selected, the model must specify how the state evolves under the admissible intervention\. This corresponds to the dynamics, constraints, and next\-state blocks in Figure[3](https://arxiv.org/html/2605.30542#S3.F3)\. A simple query may only require a closed\-form relation, such as a kinematic equation, conservation law, or stability condition\. A query involving contact, liquid transfer, deformation, or coupled rigid\-body motion may require a numerical solver\. A learned simulator or constrained surrogate may be appropriate when analytic equations are unavailable, numerical solvers are too expensive, or the relevant physical process is only partially specified\. A composite model may combine these pieces, for example by using learned perception to recover geometry, an analytic calculation for a stability margin, a numerical solver for contact, and a verifier for safety\. The goal is to select dynamics with the fidelity required by the query, not to build a universal simulator of the scene\.

The important requirement is that the selected dynamics preserve the physical structure that determines the query response\. A learned simulator may be physically viable if it respects the relevant regimes, invariants, and constraints\. A numerical simulator may fail if it uses the wrong physical regime or omits the parameter that the query depends on\. A reduced analytic model may be preferable when it captures the decisive quantity without introducing unnecessary state\.

Query\-level response\.The output of a physically viable world model is the query\-level response required for decision\-making\. This corresponds to the prediction block in Figure[3](https://arxiv.org/html/2605.30542#S3.F3)\. For example, the answer may be a trajectory, a parameter estimate, a feasible action set, a reachability set, a stability boundary, or a verification certificate\. The model is judged by whether its response preserves the physical distinctions the query depends on\. A visually plausible rollout can still fail if it omits the variable that determines the intervention outcome\.

Orchestrator\.This construction is useful only if the abstraction and components can be selected automatically, which remains the central open problem\. An early orchestrator does not need to solve open\-ended physical reasoning\. It can instead operate conservatively by starting from a library of physical abstractions, selecting candidate variables from the query and scene context, testing whether latent parameters are identifiable from available evidence, routing between analytic, numerical, learned, or hybrid models, and returning uncertainty when the construction is underdetermined\. Existing tools already provide partial mechanisms for this role, including model selection in system identification, program synthesis, differentiable simulation, tool\-using language models, model predictive control, active information gathering, and verification over constrained dynamics\.

The orchestrator should therefore be understood not as a fixed module, but as an adaptive, closed\-loop selection and checking process that routes between analytic models, numerical simulators, learned surrogates, estimators, controllers, and verification tools while revising the construction when the query changes, new evidence arrives, or a selected component fails a compatibility check\.It may also determine that the available information is insufficient and that additional sensing or interaction is required before the query can be answered\. Its role is to construct the simplest compatible world model for the query and ensure that the composed model supports the required reasoning under intervention\. The orchestrator does not need to be a single learned module, but it does require the selection problem itself to be explicit\. A physically viable world model should therefore not only produce an answer to the query, but also expose why the selected abstraction is adequate, which assumptions it depends on, which latent quantities remain uncertain, and which constraints were checked, making the model useful both for planning and control and for diagnosing when the query cannot be answered from the available information\.

### 3\.2Illustrative demonstrations

In this section, we support our position with constructions that show how query\-conditioned physical abstractions could address the failure modes we have identified\.

![Refer to caption](https://arxiv.org/html/2605.30542v1/x4.png)Figure 4:Counterfactual evaluation of a ramp\-to\-cup intervention query under varying release heights, ball materials, and fill conditions\.Placing a ball on a ramp:The first demonstration considers a ball released on a ramp toward a fluid\-filled cup\. The intervention query asks where the ball should be released so that the cup tips over, or where it should be released so that the cup remains upright\. The query\-level response is the set of release conditions that produce tipping or non\-tipping behavior\.

Figure[4](https://arxiv.org/html/2605.30542#S3.F4)shows world model evaluations of this query under different release heights, ball materials, and fill conditions\. Each rollout keeps the scene geometry nearly fixed while changing physical parameters that affect the outcome\. The same visible setup can tip or remain upright depending on the ball mass, restitution, release height, fluid volume, and fluid response\. These parameters shift the release conditions separating tipping from non\-tipping outcomes\. The important point is that these differences are physical rather than visual\. A steel ball, a wooden ball, and an aluminum ball can look nearly identical while transferring different momentum at impact\. Changing the release height changes the impact speed, while changing the fill level changes the cup–fluid center of mass and the stability margin\. These quantities determine the outcome of the interaction, but they may not be recoverable from visual appearance alone\.

This example illustrates the central claim of the paper\. The intervention outcome depends on latent physical quantities that observation\-predictive models can miss when visually similar scenes behave differently under action\. For this query, the physically viable abstraction is the minimal set of variables, parameters, and dynamics needed to determine whether the cup tips under intervention\. A physically viable world model must therefore represent, estimate, or condition on these quantities rather than rely only on visual similarity\.

Pouring fluid into a jar:The second demonstration considers a robot arm pouring liquid from a source glass into a receiving jar\. The intervention query is not whether a rollout visually resembles pouring, but which robot motion transfers a target volume, such as half a glass, without underfilling, overshooting, or spilling\. The query\-level response is therefore an action, policy, or termination condition rather than a rendered video\.

Letu0:Tu\_\{0:T\}denote the robot motion over horizonTT, including tilt trajectory, angular velocity, hold duration, and stopping rule\. We consider three related queries\. First, a planning query asks for a motion such that the received volume reaches a targetV⋆V^\{\\star\}within tolerance,Vrecv​\(T\)∈\[V⋆−ϵ,V⋆\+ϵ\]V\_\{\\mathrm\{recv\}\}\(T\)\\in\[V^\{\\star\}\-\\epsilon,V^\{\\star\}\+\\epsilon\], while the spilled volume satisfiesVspill​\(T\)≤δV\_\{\\mathrm\{spill\}\}\(T\)\\leq\\delta\. For a water\-like liquid, a nominal tilt and hold duration may be sufficient\. The required abstraction includes container geometry, robot pose, initial fill, liquid volume, gravity, and the fluid parameters governing transfer rate\. Second, a counterfactual action query asks how the motion should change when liquid properties change while geometry and target remain fixed\. In our longer\-pour experiment, replacing a water\-like liquid with a honey\-like, high\-viscosity liquid changes the correct action: the source glass must remain inclined longer to transfer the same half\-glass volume\. This difference is physical rather than visual\. Higher viscosity slows the transfer curve and changes the residual–spill tradeoff, so a model that treats pouring as a generic visual event may underfill slow\-flowing liquids or overshoot faster\-flowing ones\. Third, a parameter\-identification query asks which latent fluid parameter best explains an observed fixed\-motion pour\. In our viscosity\-estimation experiment, the robot executes the same prescribed trajectory for the same duration, and candidate simulations with different viscosities are scored against query\-relevant quantities: receiver fill curve, residual liquid in the source glass, spilled volume, and first\-arrival time in the receiver\. Bayesian optimization then searches over viscosity values to identify a parameter estimate, or uncertainty set, that best explains the observed pour and can be passed to the planner for a subsequent target volume or tolerance\.

This example illustrates the role of the orchestrator\. If the liquid is known, it may directly search over tilt angle, duration, and stopping rule\. If the liquid is unknown, it should insert a probing action, estimate viscosity from observed fill, residual, spill, and arrival behavior, and then re\-plan under the inferred parameter\. If the estimate remains ambiguous near the decision boundary, the response should remain conditional or use feedback termination based on receiver fill\. Thus, viscosity is not a visual attribute but a decision\-relevant latent variable\. A physically viable world model must expose, estimate, and use this variable when the intervention query depends on it\.

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/genesis_pour_target_honey_2x5_timestep_grid.jpg)Figure 5:Viscosity\-dependent pouring under a query\-conditioned world model\. \(a\) A fixed probing pour is used to infer the viscosity of an unknown liquid from flow, fill, residual, and spill behavior\. \(b\) The estimated viscosity is then used to select a pouring motion and stopping condition that transfers a target volume equal to half of the liquid initially in the source glass\.![Refer to caption](https://arxiv.org/html/2605.30542v1/x5.png)Figure 6:Query\-conditioned construction of a physically viable driving model from an observation\-derived Gaussian splat\. Top: the original reconstructed scene representation\. Bottom: the scene augmented with rigid support geometry and MPM\-based fluid simulation for evaluating flooded\-road traversal under rigid\-fluid interaction\. The orange circle highlights the position of the vehicle in each simulation\.Driving a truck on flooded roads:We next consider a driving query in a reconstructed outdoor scene with flooded roads\. The intervention query asks whether the truck can traverse the flooded region without becoming immobilized, unstable, or submerged beyond critical depth\. The output is a traversal judgment, feasible route, or unsafe region conditioned on vehicle assumptions\.

The scene is reconstructed from aerial observations using a Gaussian splat representationKerblet al\.\([2023](https://arxiv.org/html/2605.30542#bib.bib69)\), which serves as a local geometric map of road layout, terrain shape, obstacles, and appearance from sparse views\. Figure[6](https://arxiv.org/html/2605.30542#S3.F6)compares this observation\-based representation with the physically viable model constructed for the traversal query\. The top row shows the original Gaussian splat scene with rigid support geometry, while the bottom row augments it with material point method \(MPM\)\-based fluid simulation\. The reconstruction alone does not contain the physical structure needed to answer the query, but it provides the geometric basis for constructing the query\-specific model\.

The orchestrator augments this geometry with intervention\-relevant physics\. Static scene Gaussians are treated as rigid terrain and support boundaries; reconstructed elevation determines where water pools; and an MPM fluid simulation evolves the flooded region under gravity and rigid interaction\. The truck is modeled as a rigid body and evaluated using traversal\-relevant quantities, including water depth, drag, wheel submersion, traction loss, and stability margins\. Under these flooded\-road dynamics, the truck fails the query’s safety condition, indicating that the flooded region is not safely traversable under the assumed vehicle and terrain parameters\.

Several quantities remain unresolved from aerial observations alone, including terrain compliance, subsurface support, tire–soil interaction, road friction, water depth, current velocity, and vehicle mass distribution\. A physically viable response should therefore expose these assumptions, evaluate sensitivity to unresolved parameters, or return conditional predictions when the outcome depends on variables not identifiable from the available evidence\.

## 4Conclusion & limitations

In this paper, we argued that physically viable world models should instead construct query\-conditioned representations that preserve the physical distinctions relevant to the intervention being considered\. In this view, physical structure is not an emergent property of large\-scale observation prediction, but an explicit requirement of the model construction process\. While this perspective does not solve all aspects of world modeling, it identifies a class of failures that cannot be resolved through scaling alone and motivates architectures that explicitly represent actions, state, dynamics, constraints, and uncertainty\.

Nevertheless, several open problems remain\. Physical viability is relative to the query, so no single abstraction is sufficient for every intervention\. Richer abstractions may improve robustness or transfer, but they also increase the burden of estimation and validation\. Creating a model that can identify the right abstraction, at the right level of fidelity to answer a given query, remains an open problem\. Physically viable constructions also do not overcome identifiability limits\. Latent quantities such as mass, friction, viscosity, or hidden contact geometry may remain unresolved from passive observation, requiring estimation, active interaction, or conditional responses\. Finally, the central open problem is orchestration\. A physically viable world model must select the abstraction, identify unresolved quantities, choose dynamics and constraints, and determine the form of the response\. Task and motion planning offers a precedent for hierarchical decomposition\(Garrettet al\.,[2020](https://arxiv.org/html/2605.30542#bib.bib34); Kaelbling and Lozano\-Pérez,[2013](https://arxiv.org/html/2605.30542#bib.bib35)\), and recent simulator\-grounded world models compose perception, simulation, and policy in restricted settings\(Barcellonaet al\.,[2024](https://arxiv.org/html/2605.30542#bib.bib37)\)\. Our demonstrations show how intervention queries can drive these choices in controlled settings, but autonomous orchestration remains an open research problem\.

## References

- \[1\]J\. Abou\-Chakra, K\. Rana, F\. Dayoub, and N\. Sünderhauf\(2024\)Physically embodied Gaussian splatting: a realtime correctable world model for robotics\.arXiv preprint arXiv:2406\.10788\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[2\]Artificial Analysis\(2026\)Artificial analysis text\-to\-video leaderboard \(open\-weights models\)\.Note:[https://artificialanalysis\.ai/video/leaderboard/image\-to\-video?open\-weights=true](https://artificialanalysis.ai/video/leaderboard/image-to-video?open-weights=true)Accessed: 2026\-04\-27Cited by:[§A\.3](https://arxiv.org/html/2605.30542#A1.SS3.p3.1)\.
- \[3\]Artificial Analysis\(2026\)Artificial analysis text\-to\-video leaderboard \(open\-weights models\)\.Note:[https://artificialanalysis\.ai/video/leaderboard/text\-to\-video?open\-weights=true](https://artificialanalysis.ai/video/leaderboard/text-to-video?open-weights=true)Accessed: 2026\-04\-27Cited by:[§A\.3](https://arxiv.org/html/2605.30542#A1.SS3.p3.1)\.
- \[4\]M\. Assran, A\. Bardes, D\. Fan, Q\. Garrido, R\. Howes, M\. Muckley, A\. Rizvi, C\. Roberts, K\. Sinha, A\. Zholus,et al\.\(2025\)V\-jepa 2: self\-supervised video models enable understanding, prediction and planning\.arXiv preprint arXiv:2506\.09985\.Cited by:[§A\.4](https://arxiv.org/html/2605.30542#A1.SS4.p1.1),[§A\.4](https://arxiv.org/html/2605.30542#A1.SS4.p2.1)\.
- \[5\]G\. Authors\(2024\-12\)Genesis: a generative and universal physics engine for robotics and beyond\.External Links:[Link](https://github.com/Genesis-Embodied-AI/Genesis)Cited by:[§A\.1](https://arxiv.org/html/2605.30542#A1.SS1.p3.1)\.
- \[6\]L\. Barcellona, A\. Zadaianchuk, D\. Allegro, S\. Papa, S\. Ghidoni, and E\. Gavves\(2024\)Dream to manipulate: compositional world models empowering robot imitation learning with imagination\.arXiv preprint arXiv:2412\.14957\.Cited by:[§4](https://arxiv.org/html/2605.30542#S4.p2.1)\.
- \[7\]A\. Bardes, Q\. Garrido, J\. Ponce, X\. Chen, M\. Rabbat, Y\. LeCun, M\. Assran, and N\. Ballas\(2024\)Revisiting feature prediction for learning visual representations from video\.arXiv preprint arXiv:2404\.08471\.Cited by:[§A\.4](https://arxiv.org/html/2605.30542#A1.SS4.p1.1)\.
- \[8\]P\. Battaglia, R\. Pascanu, M\. Lai, D\. Jimenez Rezende,et al\.\(2016\)Interaction networks for learning about objects, relations and physics\.Advances in neural information processing systems29\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[9\]J\. Bender and D\. Koschier\(2015\)Divergence\-free smoothed particle hydrodynamics\.InProceedings of the 14th ACM SIGGRAPH/Eurographics symposium on computer animation,pp\. 147–155\.Cited by:[§A\.1](https://arxiv.org/html/2605.30542#A1.SS1.p3.1)\.
- \[10\]T\. Beucler, M\. Pritchard, S\. Rasp, J\. Ott, P\. Baldi, and P\. Gentine\(2021\)Enforcing analytic constraints in neural networks emulating physical systems\.Physical review letters126\(9\),pp\. 098302\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[11\]K\. Bi, L\. Xie, H\. Zhang, X\. Chen, X\. Gu, and Q\. Tian\(2022\)Pangu\-Weather: a 3d high\-resolution model for fast and accurate global weather forecast\.arXiv preprint arXiv:2211\.02556\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[12\]M\. B\. Chang, T\. Ullman, A\. Torralba, and J\. B\. Tenenbaum\(2017\)A compositional object\-based approach to learning physical dynamics\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:1612\.00341Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[13\]B\. Chen, H\. Jiang, S\. Liu, S\. Gupta, Y\. Li, H\. Zhao, and S\. Wang\(2025\)PhysGen3D: crafting a miniature interactive world from a single image\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Note:arXiv:2503\.20746Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[14\]C\. Chen, Y\. Wu, J\. Yoon, and S\. Ahn\(2022\)TransDreamer: reinforcement learning with transformer world models\.InDeep Reinforcement Learning Workshop, NeurIPS 2021,Note:arXiv:2202\.09481Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p1.1)\.
- \[15\]R\. T\. Chen, Y\. Rubanova, J\. Bettencourt, and D\. K\. Duvenaud\(2018\)Neural ordinary differential equations\.Advances in neural information processing systems31\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[16\]W\. Chow, J\. Mao, B\. Li, D\. Seita, V\. Guizilini, and Y\. Wang\(2025\)Physbench: benchmarking and enhancing vision\-language models for physical world understanding\.arXiv preprint arXiv:2501\.16411\.Cited by:[§1](https://arxiv.org/html/2605.30542#S1.p3.1)\.
- \[17\]M\. Cranmer, S\. Greydanus, S\. Hoyer, P\. Battaglia, D\. Spergel, and S\. Ho\(2020\)Lagrangian neural networks\.InICLR Workshop on Deep Differential Equations,Note:arXiv:2003\.04630Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[18\]M\. Cranmer, A\. Sanchez Gonzalez, P\. Battaglia, R\. Xu, K\. Cranmer, D\. Spergel, and S\. Ho\(2020\)Discovering symbolic models from deep learning with inductive biases\.Advances in neural information processing systems33,pp\. 17429–17442\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[19\]F\. Deng, J\. Park, and S\. Ahn\(2023\)Facing off world model backbones: RNNs, transformers, and S4\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2307\.02064Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p1.1)\.
- \[20\]Z\. Ding, A\. Zhang, Y\. Tian, and Q\. Zheng\(2024\)Diffusion world model: future modeling beyond step\-by\-step rollout for offline reinforcement learning\.arXiv preprint arXiv:2402\.03570\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[21\]C\. D\. Freeman, E\. Frey, A\. Raichuk, S\. Girgin, I\. Mordatch, and O\. Bachem\(2021\)Brax – a differentiable physics engine for large scale rigid body simulation\.arXiv preprint arXiv:2106\.13281\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[22\]C\. R\. Garrett, T\. Lozano\-Pérez, and L\. P\. Kaelbling\(2020\)Pddlstream: integrating symbolic planners and blackbox samplers via optimistic adaptive planning\.InProceedings of the international conference on automated planning and scheduling,Vol\.30,pp\. 440–448\.Cited by:[§4](https://arxiv.org/html/2605.30542#S4.p2.1)\.
- \[23\]S\. Greydanus, M\. Dzamba, and J\. Yosinski\(2019\)Hamiltonian neural networks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:1906\.01563Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[24\]J\. Gu, X\. Liu, Y\. Zeng, A\. Nagarajan, F\. Zhu, D\. Hong, Y\. Fan, Q\. Yan, K\. Zhou, M\. Liu,et al\.\(2025\)" PhyWorldBench": a comprehensive evaluation of physical realism in text\-to\-video models\.arXiv preprint arXiv:2507\.13428\.Cited by:[§1](https://arxiv.org/html/2605.30542#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p3.1)\.
- \[25\]X\. Guo, J\. Huo, Z\. Shi, Z\. Song, J\. Zhang, and J\. Zhao\(2025\)T2vphysbench: a first\-principles benchmark for physical consistency in text\-to\-video generation\.arXiv preprint arXiv:2505\.00337\.Cited by:[§1](https://arxiv.org/html/2605.30542#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p3.1)\.
- \[26\]A\. Gupta, S\. Tian, Y\. Zhang, J\. Wu, R\. Martín\-Martín, and F\. Li\(2023\)MaskViT: masked visual pre\-training for video prediction\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2206\.11894Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[27\]D\. Ha and J\. Schmidhuber\(2018\)Recurrent world models facilitate policy evolution\.Advances in neural information processing systems31\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p1.1)\.
- \[28\]D\. Ha and J\. Schmidhuber\(2018\)World models\.arXiv preprint arXiv:1803\.101222\(3\),pp\. 440\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p1.1)\.
- \[29\]Y\. HaCohen, B\. Brazowski, N\. Chiprut, Y\. Bitterman, A\. Kvochko, A\. Berkowitz, D\. Shalem, D\. Lifschitz, D\. Moshe, E\. Porat,et al\.\(2026\)LTX\-2: efficient joint audio\-visual foundation model\.arXiv preprint arXiv:2601\.03233\.Cited by:[§A\.3](https://arxiv.org/html/2605.30542#A1.SS3.p1.1)\.
- \[30\]D\. Hafner, T\. Lillicrap, J\. Ba, and M\. Norouzi\(2019\)Dream to control: learning behaviors by latent imagination\.arXiv preprint arXiv:1912\.01603\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p1.1)\.
- \[31\]D\. Hafner, T\. Lillicrap, I\. Fischer, R\. Villegas, D\. Ha, H\. Lee, and J\. Davidson\(2019\)Learning latent dynamics for planning from pixels\.InInternational conference on machine learning,pp\. 2555–2565\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[32\]D\. Hafner, T\. Lillicrap, M\. Norouzi, and J\. Ba\(2021\)Mastering Atari with discrete world models\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2010\.02193Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p1.1)\.
- \[33\]D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap\(2023\)Mastering diverse domains through world models\.arXiv preprint arXiv:2301\.04104\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p1.1)\.
- \[34\]D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap\(2025\)Mastering diverse control tasks through world models\.Nature640\(8059\),pp\. 647–653\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[35\]N\. Hansen, H\. Su, and X\. Wang\(2024\)TD\-MPC2: scalable, robust world models for continuous control\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2310\.16828Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p1.1)\.
- \[36\]Y\. Hu, L\. Anderson, T\. Li, Q\. Sun, N\. Carr, J\. Ragan\-Kelley, and F\. Durand\(2020\)DiffTaichi: differentiable programming for physical simulation\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:1910\.00935Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[37\]S\. Huang, J\. Wu, Q\. Zhou, S\. Miao, and M\. Long\(2025\)Vid2World: crafting video diffusion models to interactive world models\.arXiv preprint arXiv:2505\.14357\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[38\]T\. Huang, H\. Zhang, Y\. Zeng, Z\. Zhang, H\. Li, W\. Zuo, and R\. W\. H\. Lau\(2025\)DreamPhysics: learning physics\-based 3d dynamics with video diffusion priors\.InAAAI Conference on Artificial Intelligence,Note:arXiv:2406\.01476Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[39\]M\. Janner, Y\. Du, J\. B\. Tenenbaum, and S\. Levine\(2022\)Planning with diffusion for flexible behavior synthesis\.InProceedings of the 39th International Conference on Machine Learning \(ICML\),Proceedings of Machine Learning Research, Vol\.162,pp\. 9902–9915\.Note:arXiv:2205\.09991Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[40\]K\. M\. Jatavallabhula, M\. Macklin, F\. Golemo, V\. Voleti, L\. Petrini, M\. Weiss, B\. Considine, J\. Parent\-Levesque, K\. Xie, K\. Erleben, L\. Paull, F\. Shkurti, D\. Nowrouzezahrai, and S\. Fidler\(2021\)gradSim: differentiable simulation for system identification and visuomotor control\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2104\.02646Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[41\]L\. P\. Kaelbling and T\. Lozano\-Pérez\(2013\)Integrated task and motion planning in belief space\.The International Journal of Robotics Research32\(9\-10\),pp\. 1194–1227\.Cited by:[§4](https://arxiv.org/html/2605.30542#S4.p2.1)\.
- \[42\]B\. Kang, Y\. Yue, R\. Lu, Z\. Lin, Y\. Zhao, K\. Wang, G\. Huang, and J\. Feng\(2024\)How far is video generation from world model: a physical law perspective\.arXiv preprint arXiv:2411\.02385\.Cited by:[§1](https://arxiv.org/html/2605.30542#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p3.1)\.
- \[43\]B\. Kerbl, G\. Kopanas, T\. Leimkühler, and G\. Drettakis\(2023\)3D Gaussian Splatting for real\-time radiance field rendering\.ACM Transactions on Graphics42\(4\)\.Note:arXiv:2308\.04079Cited by:[§3\.2](https://arxiv.org/html/2605.30542#S3.SS2.p9.1)\.
- \[44\]T\. Kipf, E\. van der Pol, and M\. Welling\(2020\)Contrastive learning of structured world models\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:1911\.12247Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[45\]R\. Lam, A\. Sanchez\-Gonzalez, M\. Willson, P\. Wirnsberger, M\. Fortunato, F\. Alet, S\. Ravuri, T\. Ewalds, Z\. Eaton\-Rosen, W\. Hu, A\. Merose, S\. Hoyer, G\. Holland, O\. Vinyals, J\. Stott, A\. Pritzel, S\. Mohamed, and P\. Battaglia\(2022\)GraphCast: learning skillful medium\-range global weather forecasting\.arXiv preprint arXiv:2212\.12794\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[46\]Y\. Li, J\. Wu, R\. Tedrake, J\. B\. Tenenbaum, and A\. Torralba\(2019\)Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:1810\.01566Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[47\]Lightricks\(2026\)LTX\-2 model card\.Note:[https://huggingface\.co/Lightricks/LTX\-2](https://huggingface.co/Lightricks/LTX-2)Cited by:[§A\.3](https://arxiv.org/html/2605.30542#A1.SS3.p2.2)\.
- \[48\]Y\. Liu, B\. Huang, Z\. Zhu, H\. Tian, M\. Gong, Y\. Yu, and K\. Zhang\(2023\)Learning world models with identifiable factorization\.arXiv preprint arXiv:2306\.06561\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[49\]Z\. Liu, W\. Ye, Y\. Luximon, P\. Wan, and D\. Zhang\(2025\)PhysFlow: unleashing the potential of multi\-modal foundation models and video diffusion for 4d dynamic physical scene simulation\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Note:arXiv:2411\.14423Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[50\]F\. Locatello, D\. Weissenborn, T\. Unterthiner, A\. Mahendran, G\. Heigold, J\. Uszkoreit, A\. Dosovitskiy, and T\. Kipf\(2020\)Object\-centric learning with slot attention\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2006\.15055Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[51\]F\. Meng, J\. Liao, X\. Tan, W\. Shao, Q\. Lu, K\. Zhang, Y\. Cheng, D\. Li, Y\. Qiao, and P\. Luo\(2024\)Towards world simulator: crafting physical commonsense\-based benchmark for video generation\.arXiv preprint arXiv:2410\.05363\.Cited by:[§1](https://arxiv.org/html/2605.30542#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p3.1)\.
- \[52\]V\. Micheli, E\. Alonso, and F\. Fleuret\(2023\)Transformers are sample\-efficient world models\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2209\.00588Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p1.1)\.
- \[53\]J\. J\. Monaghan\(2005\)Smoothed particle hydrodynamics\.Reports on progress in physics68\(8\),pp\. 1703–1759\.Cited by:[§A\.1](https://arxiv.org/html/2605.30542#A1.SS1.p3.1)\.
- \[54\]S\. Motamed, L\. Culp, K\. Swersky, P\. Jaini, and R\. Geirhos\(2026\)Do generative video models understand physical principles?\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,pp\. 948–958\.Cited by:[§1](https://arxiv.org/html/2605.30542#S1.p3.1)\.
- \[55\]F\. O’Mahony, R\. Cipolla, and A\. Tewari\(2025\)VDAWorld: world modelling via vlm\-directed abstraction and simulation\.arXiv preprint arXiv:2512\.11061\.Cited by:[§B\.2](https://arxiv.org/html/2605.30542#A2.SS2.p4.1)\.
- \[56\]J\. Pathak, S\. Subramanian, P\. Harrington, S\. Raja, A\. Chattopadhyay, M\. Mardani, T\. Kurth, D\. Hall, Z\. Li, K\. Azizzadenesheli, P\. Hassanzadeh, K\. Kashinath, and A\. Anandkumar\(2022\)FourCastNet: a global data\-driven high\-resolution weather model using adaptive fourier neural operators\.arXiv preprint arXiv:2202\.11214\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[57\]T\. Pfaff, M\. Fortunato, A\. Sanchez\-Gonzalez, and P\. W\. Battaglia\(2021\)Learning mesh\-based simulation with graph networks\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2010\.03409Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[58\]M\. Raissi, P\. Perdikaris, and G\. E\. Karniadakis\(2019\)Physics\-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations\.Journal of Computational physics378,pp\. 686–707\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[59\]M\. Rigter, T\. Gupta, A\. Hilmkil, and C\. Ma\(2024\)AVID: adapting video diffusion models to world models\.arXiv preprint arXiv:2410\.12822\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[60\]A\. Sanchez\-Gonzalez, J\. Godwin, T\. Pfaff, R\. Ying, J\. Leskovec, and P\. W\. Battaglia\(2020\)Learning to simulate complex physics with graph networks\.InInternational Conference on Machine Learning \(ICML\),Note:arXiv:2002\.09405Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[61\]J\. Schrittwieser, I\. Antonoglou, T\. Hubert, K\. Simonyan, L\. Sifre, S\. Schmitt, A\. Guez, E\. Lockhart, D\. Hassabis, T\. Graepel, T\. Lillicrap, and D\. Silver\(2020\)Mastering Atari, Go, chess and shogi by planning with a learned model\.Nature588\(7839\),pp\. 604–609\.Note:arXiv:1911\.08265Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p1.1)\.
- \[62\]The Newton Contributors\(2025\-04\)Newton: GPU\-accelerated physics simulation for robotics and simulation research\.External Links:[Link](https://github.com/newton-physics/newton)Cited by:[§A\.1](https://arxiv.org/html/2605.30542#A1.SS1.p2.1)\.
- \[63\]T\. Xie, Z\. Zong, Y\. Qiu, X\. Li, Y\. Feng, Y\. Yang, and C\. Jiang\(2024\)PhysGaussian: physics\-integrated 3d gaussians for generative dynamics\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Note:arXiv:2311\.12198Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[64\]C\. Zhang, D\. Cherniavskii, A\. Tragoudaras, A\. Vozikis, T\. Nijdam, D\. W\. Prinzhorn, M\. Bodracska, N\. Sebe, A\. Zadaianchuk, and E\. Gavves\(2025\)Morpheus: benchmarking physical reasoning of video generative models with real physical experiments\.arXiv preprint arXiv:2504\.02918\.Cited by:[§1](https://arxiv.org/html/2605.30542#S1.p3.1)\.
- \[65\]Q\. Zhang, P\. Jing, H\. Yu, F\. Ding, F\. Nie, W\. Wang, Y\. Du, J\. Zou, J\. Wu, and B\. Shuai\(2026\)Physion\-eval: evaluating physical realism in generated video via human reasoning\.arXiv preprint arXiv:2603\.19607\.Cited by:[§1](https://arxiv.org/html/2605.30542#S1.p3.1)\.
- \[66\]T\. Zhang, H\. Yu, R\. Wu, B\. Y\. Feng, C\. Zheng, N\. Snavely, J\. Wu, and W\. T\. Freeman\(2024\)Physdreamer: physics\-based interaction with 3d objects via video generation\.InEuropean Conference on Computer Vision,pp\. 388–406\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.
- \[67\]W\. Zhang, B\. Terver, A\. Zholus, S\. Chitnis, H\. Sutaria, M\. Assran, R\. Balestriero, A\. Bar, A\. Bardes, Y\. LeCun,et al\.\(2026\)Hierarchical planning with latent world models\.arXiv preprint arXiv:2604\.03208\.Cited by:[§A\.4](https://arxiv.org/html/2605.30542#A1.SS4.p2.1)\.
- \[68\]Z\. Zhou and D\. Negrut\(2025\)ChronoDreamer: action\-conditioned world model as an online simulator for robotic planning\.arXiv preprint arXiv:2512\.18619\.Cited by:[§2\.2](https://arxiv.org/html/2605.30542#S2.SS2.p4.1)\.

## Appendix APlatforms for Numerical Experimentation

In this section, we describe the computational tools and platforms used in our experimental pipeline\. These include foundation models \(e\.g\., vision\-language models, video diffusion models\) as well as simulation environments for controlled evaluation\. We prioritize open\-source models with competitive performance to balance reproducibility and capability\. Where applicable, we report model variants, prompts and implementation details to facilitate replication of our results\.

### A\.1Physics Engines for Controlled Experiments

We use simulator\-generated rollouts as controlled references for intervention queries\. In each rollout, the initial scene, material and contact parameters, solver settings, and intervention are specified explicitly, and the state is advanced by numerical time integration rather than by a learned transition model\. Rendered videos are then derived from the simulated state trajectory\. These rollouts are not necessarily identical to real\-world physics, yet they serve as controlled physical references under known modeling assumptions\.

We use NVIDIA Newton\[[62](https://arxiv.org/html/2605.30542#bib.bib40)\], a Warp\-based physics engine, for rigid\-body, deformable\-body, and robot\-contact experiments\. Newton provides GPU\-accelerated simulation with XPBD\-based contact for rigid bodies and particle\-based methods for deformable objects, along with support for articulated robot interaction\. We instantiate a small set of controlled scenes, most following a ramp interaction: a ball descends an incline, transitions to a flat surface, and interacts with a target object \(e\.g\., rigid stack, container, or deformable body\)\. Scene geometry, camera, and rendering are held fixed while physical parameters—such as density, friction, and restitution—are varied through Newton’s material and contact models\. Rigid interactions are resolved with the XPBD contact solver, while deformable objects use particle\-based simulation such as material point method \(MPM\) with self\-contact\. We additionally include a robot\-contact setting in which a Franka Panda end\-effector follows a prescribed trajectory and interacts with the environment under Newton’s articulated\-body and contact simulation\.

We also use Genesis\[[5](https://arxiv.org/html/2605.30542#bib.bib32)\], a GPU\-accelerated physics simulation platform designed for robotics and embodied AI in fluid\-coupled experiments\. In particular, Genesis supports particle\-based fluid simulation via smoothed\-particle hydrodynamics \(SPH\)\[[53](https://arxiv.org/html/2605.30542#bib.bib41)\], including divergence\-free formulations \(DFSPH\)\[[9](https://arxiv.org/html/2605.30542#bib.bib42)\], and couples these with rigid\-body dynamics through shared collision and boundary representations\. In our setup, liquids are represented explicitly as SPH particles and evolved jointly with rigid bodies under the same simulation loop\. Rigid–fluid interaction is handled through mesh\-derived signed distance fields, enabling consistent coupling between container motion and fluid response\. We use this capability to construct a small set of controlled fluid–interaction scenarios, including container transport and pouring, where object motion may be prescribed but fluid dynamics are fully simulated\. Similarly, across these experiments, we hold scene geometry, camera, and control trajectories fixed, and vary only fluid properties such as viscosity\. This isolates fluid parameters as the latent factors governing outcomes such as retention, transfer, and spillage, while all state evolution is determined by numerical simulation rather than learned dynamics\.

Across both engines, the simulator state defines the reference trajectory, and rendered videos are derived from this state\. Physical parameters are specified or calibrated independently rather than inferred end\-to\-end from visual input, separating simulation from perception and parameter estimation\.

### A\.2Vision Language Model

We evaluateGPT\-5\.5as a vision\-language baseline for direct physical outcome prediction from static scene observations\. We test whether a VLM can infer the likely physical evolution of a scene directly from an image and a natural\-language query\. We repeat each prompt five times and exclude blank responses from the qualitative analysis\. The model uses “medium” reasoning effort by default\. Filenames are hidden from the model so that labels such ascup\_water\.pngdo not leak physical information\.

### A\.3Audio\-Visual Foundation Model

We employLTX\-2\[[29](https://arxiv.org/html/2605.30542#bib.bib15)\]as our primary video generation model\. LTX\-2 is an open\-source diffusion transformer \(DiT\)\-based audio\-visual foundation model designed to generate temporally consistent video with synchronized audio within a unified architecture\. In this work, we focus primarily on video fidelity and temporal accuracy\. The model supports multiple conditioning modalities, including text\-to\-video and image\-to\-video generation\. In our experiments, we primarily use text\-conditioned video generation with keyframe anchoring\. The resulting outputs are used as a diffusion\-based baseline for comparison against physics\-driven simulation results\.

For our experiments, we follow the two\-stage inference pipeline provided in the officialLTX\-2implementation\[[47](https://arxiv.org/html/2605.30542#bib.bib16)\]\. The base model consists of approximately1919billion parameters, with additional upscaling and refinement models used in the second stage\. To ensure reproducibility, all trials are seeded\. We additionally anchor the initial frame \(frame0\) with full strength to enforce a consistent initialization aligned with the reference video\.

As of April 2026,LTX\-2ranks highly among open\-source video generation models\. According to Artificial Analysis, which evaluates models across multiple quality and temporal consistency metrics,LTX\-2is ranked first on both the text\-to\-video and image\-to\-video leaderboards\[[3](https://arxiv.org/html/2605.30542#bib.bib17),[2](https://arxiv.org/html/2605.30542#bib.bib18)\]\. This performance supports its use as a strong baseline in our comparative evaluation\.

### A\.4Vision\-based World Models

Vision\-based world models remain the most common and actively researched subcategory for this subject\. In this work, we also experiment with theV\-JEPA 2platform\[[4](https://arxiv.org/html/2605.30542#bib.bib20)\]from Meta\.V\-JEPA 2is a publicly available self\-supervised video representation model based on the joint\-embedding predictive architecture\. Unlike diffusion\-based video generators, V\-JEPA\[[7](https://arxiv.org/html/2605.30542#bib.bib19)\]does not generate pixels directly; instead, it learns predictive representations by modeling masked or future regions in latent space\. V\-JEPA 2 extends this framework with large\-scale video pretraining and limited robot interaction data, enabling and achieving competitive results in downstream tasks such as video understanding and motion planning in robotics\.

In our experiments,V\-JEPA 2is used as a latent predictive world\-modeling baseline; in particular, we utilize theV\-JEPA 2\-ACpipeline\[[4](https://arxiv.org/html/2605.30542#bib.bib20),[67](https://arxiv.org/html/2605.30542#bib.bib21)\], an action\-conditioned dynamics model that predicts future latent states given the current state and candidate actions, enabling forward simulation without explicit physics modeling\. At runtime, control is performed via model predictive control \(MPC\), where sequences of actions are sampled, rolled out through the learned dynamics, and evaluated using a task\-specific cost defined in latent space, usually the distance to a goal representation\. The process is purely vision\-based and not explicitly physically constrained; we compare the action sequences to true physics simulation results across different control scenarios\.

## Appendix BAdditional Experimental Results

In this section, we provide details and additional numerical experiments and results\.

### B\.1VLM prediction under controlled physical variation

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/superball.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/single_jelly.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/double_jelly.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/cup_water.png)

Figure 7:Static\-image benchmark for VLM physical prediction\. Each scene shows a ball rolling down a ramp into different interaction regimes: rigid\-body collision, single deformable\-wall impact, double deformable\-wall impact, and rigid–fluid interaction with a water\-filled cup\. The benchmark tests whether VLM predictions track the underlying physical response under intervention rather than only visual similarity\.We evaluateGPT\-5\.5on static\-image physical prediction using four rendered ramp\-interaction scenes: rigid blocks and ball, a single deformable jelly wall, two deformable jelly walls, and a water\-filled cup shown in Figure[7](https://arxiv.org/html/2605.30542#A2.F7)\. Each scene places a ball near the top of a wooden ramp connected to a flat platform, with the target in the ball’s downstream path\.

Each image is queried under multiple levels of context\. Theno\_contextprompt asks the model to describe the motion and final resting locations of the objects using only the image and a generic instruction\. Thelow\_contextprompt names the main objects and the fixed ramp\-platform geometry but does not specify material properties or the full intervention\. Thehigh\_contextprompt specifies that the ball is made of aluminum, that it is released from rest, that it rolls down the ramp under gravity, and that it collides with the relevant target object\. For the cup scene, the high\-context prompt also states that the cup is filled with water\. This design probes whether additional textual context changes the model’s physical prediction, especially when the relevant physical variables are latent, visually ambiguous, or only partially specified\. We additionally evaluatecounterfactualprompts that modify one physical property of the scene while keeping the visible configuration fixed, including material changes, altered release height, increased friction, and increased fluid viscosity\.

For the cup scene, the no\-context query asks only:Describe the motion and final resting location of the objects in the scene\. Explain your reasoning\.The low\-context query additionally specifies the ramp, platform, ball, and cup\. The high\-context query specifies that an aluminum ball is released from rest, rolls down the ramp under gravity, and collides with a water\-filled cup positioned in its path\. The counterfactual query modifies the setup further by asking how the outcome changes if the ball starts from half the ramp height and the liquid is honey instead of water\. The other scenes are worded similarly\.

The reference outcomes are: \(i\) the rigid blocks and red ball scatter; \(ii\) the single jelly wall topples forward because the ball strikes below its center of mass; \(iii\) the two jelly walls deform but remain upright and slide because table friction is low; and \(iv\) the water\-filled cup tips and spills\.

The responses show several recurring patterns\.

1. 1\.Low\-context prompts sometimes produce incorrect physical interpretations of the scene\.In several trials, the model predicts no collision because it states that the target is not in the ball’s path, particularly in the cup and double\-jelly scenes\. Other responses infer incorrect scene geometry or motion entirely, including predicting that objects slide off the table due to the apparent camera angle or assuming that the ball is never released\. These responses show that, under weak contextual specification, the model can construct a qualitatively incorrect physical scenario before detailed reasoning about collision, deformation, or fluid interaction occurs\.
2. 2\.High\-context prompts isolate the remaining physical prediction errors\.When release and collision are explicitly specified, many low\-context ambiguities disappear and the responses usually invoke qualitatively relevant effects including rolling motion, momentum transfer, deformation, frictional dissipation, and water sloshing\. However, the model still does not consistently or precisely predict the realized outcome\. In the single\-jelly\-wall scene, responses often predict deformation or sliding but omit the observed forward toppling caused by the low strike point\. In the two\-jelly\-wall scene, some responses correctly describe sliding deformation without overturning, while others predict collapse or toppling\. In the cup scene, the model often states that tipping and spilling may occur rather than predicting the observed spill outcome\. The remaining errors therefore persist even after the intervention itself is explicitly specified\.
3. 3\.Counterfactual prompts improve directional reasoning but still produce threshold uncertainty\.The model usually predicts the correct qualitative trend: lower release height reduces impact energy, honey damps sloshing, high friction suppresses sliding, and gelatin dissipates more energy than rigid blocks\. However, the responses still frequently describe tipping, toppling, sliding, and spilling as possible outcomes rather than resolving which event occurs\. In one response, the model also assumes that an aluminum ball has lower mass than the original ball even though the original material is unspecified\.

These results separate qualitative explanation from intervention prediction\. The VLM can identify relevant physical effects, especially when the prompt supplies the intervention\. It does not consistently predict the realized outcome when that outcome depends on thresholded contact, deformation, or rigid–fluid dynamics\. A physically viable model must represent the relevant variables, propagate them through the selected dynamics, and return the realized outcome or a justified uncertainty set\.

### B\.2Visual Foundation Models are Not Physically Viable Simulators

While recent advances in diffusion\-based video generation have significantly improved visual fidelity and temporal coherence, these models remain fundamentally limited and unfit as physical simulators\. In particular, conditioning on key frames and using text\-based prompt guidance do not sufficiently constrain the underlying dynamics to ensure physical consistency over time\.

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/newton_sim_4x5_timestep_grid.jpg)Figure 8:Controlled ramp\-to\-tower rigid\-body interactions under latent material variation\. Each row shows a time sequence from one condition in the simulation suite: \(a\) the density\-variation setup with a wood tower, \(b\) the density\-variation setup with a steel tower, \(c\) the restitution setup with a high\-restitution projectile, and \(d\) the restitution setup with both a high\-restitution projectile and steel cubes\.![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/cup_demo_frame_001.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/cup_demo_frame_002.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/cup_demo_frame_003.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/cup_demo_frame_004.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/cup_demo_frame_005.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/cup_demo_frame_006.jpg)

\(a\)Physical simulation plus rendering\.
![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/cup_diff_frame_001.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/cup_diff_frame_002.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/cup_diff_frame_003.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/cup_diff_frame_004.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/cup_diff_frame_005.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/cup_diff_frame_006.jpg)

\(b\)Diffusion\-based video generation with fixed initial frame and text guidance\.

Figure 9:Comparison between physical simulation and visual diffusion model in a ball and cup collision scenario, the simulation captures both rigid body interaction, as well as fluid behavior through particle dynamics\. The diffusion model is prompted withContinue from this frame, ball rolling down ramp, accelerating naturally while staying in contact with the surface, transfers momentum into the cup filled with water, knocks it over, causing water to spill onto the surface\.![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/jelly_demo_2_frame_001.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/jelly_demo_2_frame_002.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/jelly_demo_2_frame_003.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/jelly_demo_2_frame_004.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/jelly_demo_2_frame_005.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/jelly_demo_2_frame_006.jpg)

\(a\)Physical simulation plus rendering\.
![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/jelly_diff_2_frame_001.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/jelly_diff_2_frame_002.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/jelly_diff_2_frame_003.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/jelly_diff_2_frame_004.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/jelly_diff_2_frame_005.jpg)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/jelly_diff_2_frame_006.jpg)

\(b\)Diffusion\-based video generation with fixed initial frame and text guidance\.

Figure 10:Comparison between physical simulation and visual diffusion model in a ball and two jelly blocks collision scenario, the physicaly simulation correctly handles interaction between plastic and elastic materials\. The diffusion model is prompted withContinue from this frame, ball rolling down ramp, accelerating naturally while staying in contact with the surface, and transfers momentum into the jelly blocks on impact\.We illustrate this limitation through two representative examples\. In each case, we provide a fixed initial state and a carefully specified prompt describing the intended physical process\. We use physics\-based simulation with rendering as a comparative baseline for both examples\. To ensure a fair comparison, we keep key hyperparameters such as the total number of frames, aspect ratio, and resolution to be fixed across all runs\. Additionally, following theLTX\-2guidelines, we include negative prompts \(e\.g\., unrealistic physics”, extra balls”\) to further constrain the diffusion model outputs\.

We display the results in[Figure˜9](https://arxiv.org/html/2605.30542#A2.F9)and[Figure˜10](https://arxiv.org/html/2605.30542#A2.F10)\. Despite detailed instructions specifying the desired physical behavior, the generated videos fail to remain even close to the true dynamics\. In[Figure˜9](https://arxiv.org/html/2605.30542#A2.F9), contact does not yield correct motion for either the ball or the cup, and the fluid behavior is unstable and inconsistent\. Similarly, in[Figure˜10](https://arxiv.org/html/2605.30542#A2.F10), we do not observe physically correct contact between objects\. We also notice that despite being explicit in our instruction, probabilistic models like text conditioned diffusion models can often fail to follow he exact prompts\.

Across both examples, the failure modes are systematic rather than incidental\. They stem from the absence of explicit physical constraints in the generative process\. As a result, despite improvements in visual fidelity, even state\-of\-the\-art visual foundation models cannot reliably replace even simple physics\-based simulators in settings where accurate dynamics are required\. As noted in\[[55](https://arxiv.org/html/2605.30542#bib.bib23)\], direct pixel prediction is insufficient and unreliable for the accuracy requirements of world models\.

### B\.3Failure Modes of Vision\-Based World Models in Control

We next evaluate whether a vision\-based world model can be used as a control\-oriented simulator in contact\-rich manipulation\. Unlike the diffusion models considered in[Section˜B\.2](https://arxiv.org/html/2605.30542#A2.SS2),V\-JEPA 2does not directly synthesize future image frames\. Instead, the action\-conditioned model predicts future latent representations conditioned on the current visual state, robot pose, and candidate action sequences\. We then use MPC to select the action trajectory whose predicted latent state is closest to the goal representation\. The selected trajectory is subsequently rendered through the Newton physics simulator in order to inspect the physical outcome\.

This setting is more favorable than direct pixel generation because the final visualization is still produced by a physics engine\. However, it exposes a different and more control\-specific failure mode\. The planner optimizes for visual latent similarity, not for physical feasibility\. The model is not explicitly constrained by contact mechanics, friction cones, object mass, material density, torque limits, or the force histories required to move an object\. Consequently, a trajectory that appears promising under the learned latent objective may still fail when executed as a physical interaction\.

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/High_push_gt/frame_001.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/High_push_gt/frame_002.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/High_push_gt/frame_003.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/High_push_gt/frame_004.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/High_push_gt/frame_005.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/High_push_gt/frame_006.png)

\(a\)Ground\-truth physical rollout of a wooden wall\.
![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/High_push/frame_002.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/High_push/frame_003.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/High_push/frame_004.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/High_push/frame_005.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/High_push/frame_006.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/High_push/frame_007.png)

\(b\)V\-JEPA2 trajectory rendered in the simulation under high\-push setting\.
![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Concrete_gt/frame_002.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Concrete_gt/frame_003.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Concrete_gt/frame_004.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Concrete_gt/frame_005.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Concrete_gt/frame_006.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Concrete_gt/frame_007.png)

\(c\)Ground\-truth physical rollout of a concrete wall\.
![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Concrete/frame_002.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Concrete/frame_003.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Concrete/frame_004.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Concrete/frame_005.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Concrete/frame_006.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Concrete/frame_007.png)

\(d\)V\-JEPA 2 trajectory rendered in the simulation with the wall treated as concrete\.

Figure 11:Comparison between the ground\-truth physical rollout and V\-JEPA 2\-generated trajectories rendered in simulation\. The high\-push and concrete cases show that the trajectory generator can produce visually plausible motion, but does not reason about material\-dependent contact dynamics or force feasibility\.![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Low_push_gt/frame_001.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Low_push_gt/frame_002.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Low_push_gt/frame_003.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Low_push_gt/frame_004.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Low_push_gt/frame_005.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Low_push_gt/frame_006.png)

\(a\)Ground\-truth physical rollout of low pushing point\.
![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Low_push/frame_000.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Low_push/frame_002.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Low_push/frame_003.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Low_push/frame_004.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Low_push/frame_005.png)

![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/VJEPA/Low_push/frame_006.png)

\(b\)V\-JEPA2 trajectory rendered in the simulation under the low\-push setting\.

Figure 12:Comparison between the ground\-truth physical rollout and the V\-JEPA2\-generated trajectory rendered in simulation under the low\-push setting\.![Refer to caption](https://arxiv.org/html/2605.30542v1/figures/combined_pour_timestep_rows.jpg)Figure 13:Viscosity\-dependent robot\-arm pouring\. Each row shows a time sequence for a different liquid condition: \(a\) water\-like reference, \(b\) honey\-like reference, \(c\) synthetic target liquid, and \(d\) the best\-match viscosity estimate obtained from candidate simulations\.We evaluate this behavior in two wall\-pushing scenarios\. The first varies the contact height\. A high contact point produces a visually salient rotational response, while a low contact point requires more precise control to produce the desired interaction\. The second varies the wall material while keeping the visual structure of the task fixed\. This tests whether a vision\-based model can select actions that remain valid when the same apparent goal is governed by different physical parameters\.

The results are shown in[Figure˜11](https://arxiv.org/html/2605.30542#A2.F11)and[Figure˜12](https://arxiv.org/html/2605.30542#A2.F12)\. In the high\-push case, the V\-JEPA 2 planned trajectory captures the coarse behavior of the intended interaction, but it does not reliably reproduce the ground\-truth physical rollout\. This is expected from the structure of the objective: the model searches for actions that reduce latent visual discrepancy, not actions that satisfy the true contact dynamics of the wall\. The rendered rollout therefore reveals whether the visually selected trajectory is dynamically meaningful, and in this case the correspondence is incomplete\.

The material variation further exposes this limitation\. When the wall is treated as concrete, the task changes physically even if the visual scene remains similar\. The same end\-effector displacement can lead to a different outcome because the wall’s response depends on material\-dependent quantities such as friction and density\. Since the V\-JEPA 2 planning objective does not explicitly infer or optimize over these physical parameters, the selected trajectory does not adapt in a reliably material\-aware way\. The model may identify a visually plausible action direction, but it has no explicit mechanism for determining whether the corresponding force interaction is feasible\. These experiments illustrate a broader limitation of vision\-based world models for control\. In contact\-rich manipulation, an action is only predictive when conditioned on the relevant physical state of the scene, including geometry, contact configuration, friction, mass distribution, and material response\. A purely visual latent model can learn correlations between observations and actions, but it does not guarantee that the learned representation preserves the physical variables required for reliable planning\.

Overall, V\-JEPA 2 provides a stronger baseline than direct visual generation because it produces action\-conditioned latent rollouts rather than hallucinated future frames, enabling a path for downstream planning and control\. Nevertheless, the failure mode is conceptually similar to the diffusion results: visual plausibility is not equivalent to physical validity\. The model can optimize toward a goal representation while ignoring whether the proposed interaction is force\-feasible, material\-aware, or robust under execution\. A glass barrier that renders a target physically inaccessible may be close to invisible visually\. With these results, we argue that vision\-based predictive world models are insufficient, by themselves, for resolving the physical dynamics required by embodied robotic decision\-making\.

Similar Articles

World Action Models: The Next Frontier in Embodied AI

Hugging Face Daily Papers

This survey paper introduces World Action Models (WAMs), a unified framework for embodied AI that integrates predictive state modeling with action generation. It provides a taxonomy of existing methods, analyzes the data ecosystem, and outlines evaluation protocols for this emerging paradigm.

@drfeifei: https://x.com/drfeifei/status/2062247238143996275

X AI KOLs Timeline

Fei-Fei Li and the World Labs team present a functional taxonomy of world models, distinguishing between renderers, physics engines, and other components within the reinforcement learning loop, and arguing that spatial intelligence is AI's next frontier.