How Should World Models Be Evaluated? A Decision-Making-Centric Position

arXiv cs.LG 06/16/26, 04:00 AM Papers
world-models evaluation decision-making embodied-ai reinforcement-learning planning counterfactual-reasoning
Summary
This paper surveys evaluation methods for world models and argues for a decision-making-centric framework that prioritizes counterfactual reasoning, planning, and policy optimization over visual quality. It introduces an L0–L7 evaluation ladder and a benchmark protocol to align evaluation with claimed utility.
arXiv:2606.15032v1 Announce Type: new Abstract: World models have rapidly become one of the central abstractions in modern AI. Yet the term now refers to several different objects: action-conditioned environment models, latent imagination models, future-video predictors, interactive neural simulators, latent predictive representations, and synthetic-data engines. Evaluation has broadened with the term. Recent papers measure video realism, perceptual similarity, instruction following, physical plausibility, policy ranking, executability, planning success, and downstream policy improvement. The result is not only metric diversity but also a recurring problem of claim/evidence mismatch: papers frequently make a stronger claim about what their model is useful for than their evaluation can actually establish. This paper surveys the recent literature and argues that the central question is use-dependent. When a model is presented as a world model for embodied decision-making, a more decisive issue is not whether it generates visually compelling videos, but whether it supports reliable counterfactual reasoning, policy evaluation, planning, and policy optimization under intervention, policy-induced distribution shift, and long-horizon rollout. We organize the literature using an L0--L7 ladder that ranges from visual plausibility to policy optimization utility. In our interpretation, L0--L3 are most naturally read as diagnostics of generated artifacts, L4 is often the first genuinely interventional test, and L5--L7 provide the most direct evidence of decision usefulness. Based on this diagnosis, we propose a decision-making-centric evaluation framework and a benchmark protocol that foreground counterfactual action fidelity, closed-loop rollout validity, reward/value prediction, policy-ranking agreement, optimization lift, model exploitability, and uncertainty calibration.
Original Article
View Cached Full Text
Cached at: 06/16/26, 11:36 AM
# How Should World Models Be Evaluated? A Decision-Making-Centric Position
Source: [https://arxiv.org/html/2606.15032](https://arxiv.org/html/2606.15032)
Yang Yu1,2,, Shiyuan Zhang1,2, Yifei Sheng1,2, Haoxiang Ren1,2, Haoxin Lin1,2,3 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2School of Artificial Intelligence, Nanjing University, Nanjing, China 3Cirquar Technologies, Nanjing, China

###### Abstract

World models have rapidly become one of the central abstractions in modern AI\. Yet the term now refers to several different objects: action\-conditioned environment models, latent imagination models, future\-video predictors, interactive neural simulators, latent predictive representations, and synthetic\-data engines\. Evaluation has broadened with the term\. Recent papers measure video realism, perceptual similarity, instruction following, physical plausibility, policy ranking, executability, planning success, and downstream policy improvement\. The result is not only metric diversity but also a recurring problem of*claim/evidence mismatch*: papers frequently make a stronger claim about what their model is useful for than their evaluation can actually establish\.

This paper surveys the recent literature and argues that the central question is use\-dependent\. When a model is presented as a world model for embodied decision\-making, a more decisive issue is not whether it generates visually compelling videos, but whether it supports reliable counterfactual reasoning, policy evaluation, planning, and policy optimization under intervention, policy\-induced distribution shift, and long\-horizon rollout\. We organize the literature using an L0–L7 ladder that ranges from visual plausibility to policy optimization utility\. In our interpretation, L0–L3 are most naturally read as diagnostics of generated artifacts, L4 is often the first genuinely interventional test, and L5–L7 provide the most direct evidence of decision usefulness\. Based on this diagnosis, we propose a decision\-making\-centric evaluation framework and a benchmark protocol that foreground counterfactual action fidelity, closed\-loop rollout validity, reward/value prediction, policy\-ranking agreement, optimization lift, model exploitability, and uncertainty calibration\.

## 1Introduction

World models have rapidly become one of the most active themes in contemporary AI\. In one line of work, inherited from model\-based reinforcement learning, a world model is a learned dynamics model used for planning, imagination, policy evaluation, or policy optimization\[[23](https://arxiv.org/html/2606.15032#bib.bib1),[27](https://arxiv.org/html/2606.15032#bib.bib2),[24](https://arxiv.org/html/2606.15032#bib.bib3),[53](https://arxiv.org/html/2606.15032#bib.bib4)\]\. In another line, recent embodied video\-generation models are described as world models because they can generate plausible future observations from text, images, videos, or actions\[[58](https://arxiv.org/html/2606.15032#bib.bib5),[10](https://arxiv.org/html/2606.15032#bib.bib10),[43](https://arxiv.org/html/2606.15032#bib.bib11),[32](https://arxiv.org/html/2606.15032#bib.bib14),[42](https://arxiv.org/html/2606.15032#bib.bib15)\]\. A third line studies latent predictive representations, where the model predicts future embeddings rather than pixels\[[1](https://arxiv.org/html/2606.15032#bib.bib64),[40](https://arxiv.org/html/2606.15032#bib.bib65),[39](https://arxiv.org/html/2606.15032#bib.bib66)\]\. A fourth line uses generative models as synthetic\-data engines or executable video planners for robot learning\[[26](https://arxiv.org/html/2606.15032#bib.bib13),[2](https://arxiv.org/html/2606.15032#bib.bib49),[16](https://arxiv.org/html/2606.15032#bib.bib51),[28](https://arxiv.org/html/2606.15032#bib.bib25)\]\.

This proliferation is scientifically productive, but it has also made evaluation ambiguous\. Some papers evaluate pixel reconstruction or distributional video quality using MSE, PSNR, SSIM, LPIPS, FID, and FVD\. Others evaluate instruction following or physical plausibility using VLM judges, physical QA, or human preference\. Others use final policy success after training inside the model\. Still others use correlation between world\-model\-estimated policy success and real or simulator success\[[44](https://arxiv.org/html/2606.15032#bib.bib26),[47](https://arxiv.org/html/2606.15032#bib.bib28),[34](https://arxiv.org/html/2606.15032#bib.bib30),[45](https://arxiv.org/html/2606.15032#bib.bib22)\]\. These evaluations are not interchangeable\. A model can look like a strong video generator while being a poor environment model for control; conversely, a latent predictive model may be useful for planning without ever producing photorealistic pixels\.

Our thesis is conditional rather than universal\. We do*not*claim that every system called a world model should be judged by policy optimization\. If the intended use is future\-video generation, then video quality and semantic plausibility are legitimate primary targets\. The problem begins when evidence appropriate for one claim is rhetorically used to support a stronger claim\. If a model is presented as a world model for embodied decision\-making, then the question we find most informative is:

> *What would happen, in task\-relevant terms, if the agent took these actions from this history?*

A skeptical reader might object that visual quality, semantic plausibility, or human preference can correlate strongly with downstream utility in some settings, or that some systems called world models are simply not intended for control\. We agree\. Our claim is therefore comparative rather than eliminative\. We do not deny the usefulness of lower\-level metrics or purely generative world\-model research\. Rather, we argue that for models whose stated aim is embodied decision\-making, action\-, outcome\-, and policy\-level evaluations usually provide stronger evidence than artifact quality alone\.

This view is reflected most directly in environment\-model work on counterfactual learning, policy\-conditioned models, full\-horizon rollout, and generalizable embodied decision\-making\[[8](https://arxiv.org/html/2606.15032#bib.bib7),[7](https://arxiv.org/html/2606.15032#bib.bib8),[37](https://arxiv.org/html/2606.15032#bib.bib9),[63](https://arxiv.org/html/2606.15032#bib.bib6)\]\. It is also increasingly reflected in recent benchmarks that explicitly separate perceptual quality from functional utility\[[45](https://arxiv.org/html/2606.15032#bib.bib22),[28](https://arxiv.org/html/2606.15032#bib.bib25)\]\.

This paper focuses on*world models claimed for embodied decision\-making*: policy evaluation, planning, policy optimization, safety testing, and related uses\. It is therefore neither anti\-video\-metric nor anti\-VLM\-judge\. Our narrower claim is that, for this particular use case, these evaluations are often better interpreted as lower\-level or auxiliary diagnostics unless the claimed use is itself purely generative\. We make four contributions in this paper:

1. 1\.We provide a paper\-by\-paper survey of the recent world\-model literature, organized by*what each paper actually evaluates*\.
2. 2\.We identify a concrete and recurring failure mode in the literature:*claim/evidence mismatch*, where lower\-level evidence is informally taken to support stronger decision\-making claims\.
3. 3\.We organize the literature using an L0–L7 world\-model evaluation ladder, ranging from visual plausibility to policy optimization utility, and argue that, for decision\-making claims, L4 often marks the first genuinely interventional test while L5–L7 usually provide the most direct evidence of decision usefulness\.
4. 4\.We propose a decision\-making\-centric evaluation framework and benchmark protocol built around counterfactual branches, policy\-induced distribution shift, full\-horizon outcome fidelity, policy\-ranking agreement, optimization lift, exploitability, and uncertainty calibration\.

## 2Background and Symbols

### 2\.1A brief genealogy of the term

The phrase “world model” did not emerge in a vacuum\. In the setting most relevant to this paper, its closest ancestor is the*environment model*tradition in model\-based control and reinforcement learning\. In that tradition, the core object is an action\-conditioned predictive model of dynamics, rewards, and sometimes uncertainty, used to answer questions of the form: if the agent takes actionaafrom state or historyhh, what is likely to happen next, and what consequences will this have for return or task completion? In this narrower sense, the conceptual target is already counterfactual and decision\-theoretic: the model is valuable because it supports planning, policy evaluation, or policy improvement\[[23](https://arxiv.org/html/2606.15032#bib.bib1),[27](https://arxiv.org/html/2606.15032#bib.bib2),[24](https://arxiv.org/html/2606.15032#bib.bib3),[53](https://arxiv.org/html/2606.15032#bib.bib4)\]\.

The label “world model” became especially visible through work that paired compact representation learning with latent dynamics and a controller\[[23](https://arxiv.org/html/2606.15032#bib.bib1)\]\. In that formulation, the world model did not literally mean a photorealistic simulator of everything in the environment\. Rather, it referred to an internal predictive model of environment evolution, usually in a latent state space, sufficient to support control\. Dreamer\-style methods and real\-robot extensions such as DayDreamer preserved this basic interpretation: the world model is primarily a tool for imagined rollouts, value estimation, and policy learning\[[24](https://arxiv.org/html/2606.15032#bib.bib3),[53](https://arxiv.org/html/2606.15032#bib.bib4)\]\.

The term broadened as embodied AI and large\-scale generative modeling matured\. A first shift came from*future\-observation prediction*\. In many embodied and web\-scale settings, video is the most accessible supervisory signal, while explicit action or reward labels may be scarce or heterogeneous\. As a result, models that predict future frames, videos, or multimodal continuations began to be described as world models, especially when they were used to forecast how scenes evolve under language, image, video, or action conditions\[[10](https://arxiv.org/html/2606.15032#bib.bib10),[43](https://arxiv.org/html/2606.15032#bib.bib11),[32](https://arxiv.org/html/2606.15032#bib.bib14),[42](https://arxiv.org/html/2606.15032#bib.bib15)\]\. In this broader usage, the word “world” often refers to the model’s ability to generate plausible futures, even if the model is not directly evaluated as a policy\-evaluation or policy\-optimization tool\.

A second shift came from*interactive neural simulators*\. Once action\-conditioned video models became capable of autoregressive rollout, it became natural to reuse them as surrogate environments\. Systems such as UniSim, Vid2World, IRASim, WorldGym, and WorldArena sit in this intermediate region: they are still generative models of future observations, but they are also queried as if they were interactive environments\[[58](https://arxiv.org/html/2606.15032#bib.bib5),[25](https://arxiv.org/html/2606.15032#bib.bib27),[66](https://arxiv.org/html/2606.15032#bib.bib31),[44](https://arxiv.org/html/2606.15032#bib.bib26),[45](https://arxiv.org/html/2606.15032#bib.bib22)\]\. This blurs the boundary between “video predictor” and “world simulator\.” It also explains why evaluation becomes ambiguous: the same model can be evaluated visually in one paragraph and functionally in the next\.

A third shift came from*latent predictive representations*\. In JEPA\-style and related approaches, the modeling target is not pixel reconstruction but future latent structure\. These methods often argue, implicitly or explicitly, that a useful world model may be one that predicts abstract, planning\-relevant representations rather than photo\-realistic images\[[1](https://arxiv.org/html/2606.15032#bib.bib64),[40](https://arxiv.org/html/2606.15032#bib.bib65),[39](https://arxiv.org/html/2606.15032#bib.bib66)\]\. This line is important because it breaks the assumption that the natural output of a world model is necessarily a video\. It also sharpens the distinction between*observational fidelity*and*decision\-relevant sufficiency*\.

A fourth shift came from*world models as synthetic\-data engines or executable planners*\. In this line, the model is often not used as a general\-purpose simulator in the classical sense\. Instead, it may generate robot videos that are converted into actions, augment datasets with imagined trajectories, or produce demonstrations that improve downstream learning\[[26](https://arxiv.org/html/2606.15032#bib.bib13),[2](https://arxiv.org/html/2606.15032#bib.bib49),[16](https://arxiv.org/html/2606.15032#bib.bib51),[28](https://arxiv.org/html/2606.15032#bib.bib25)\]\. Here again, the phrase “world model” refers to a model of environmental evolution, but the operational role is neither pure dynamics learning nor pure video generation\. It is instrumental: generate useful training signal, executable plans, or rich counterfactual data\.

Viewed in this way, the current literature contains not one but several partially overlapping world\-model traditions\. The resulting ambiguity is not surprising\. The same term now covers at least six research objects: action\-conditioned environment models, latent imagination models, future\-video predictors, interactive neural simulators, latent predictive representations, and synthetic\-data engines\. These objects overlap, but they are not identical; consequently, their evaluations need not be identical either\.

PhaseObject commonly called a world modelWhy this usage emergedRepresentative worksTypical evaluation emphasisIAction\-conditioned environment modelPlanning, control, off\-policy evaluation, imagination\-based learning\[[23](https://arxiv.org/html/2606.15032#bib.bib1),[27](https://arxiv.org/html/2606.15032#bib.bib2),[24](https://arxiv.org/html/2606.15032#bib.bib3),[53](https://arxiv.org/html/2606.15032#bib.bib4)\]Policy return, sample efficiency, model\-based planning, value estimationIILatent imagination modelNeed for compact long\-horizon predictive state under partial observability\[[23](https://arxiv.org/html/2606.15032#bib.bib1),[24](https://arxiv.org/html/2606.15032#bib.bib3),[53](https://arxiv.org/html/2606.15032#bib.bib4)\]Latent rollout quality, return prediction, downstream controlIIIFuture\-video predictorAvailability of large\-scale video data; embodied tasks naturally expressed as future visual prediction\[[10](https://arxiv.org/html/2606.15032#bib.bib10),[43](https://arxiv.org/html/2606.15032#bib.bib11),[32](https://arxiv.org/html/2606.15032#bib.bib14),[42](https://arxiv.org/html/2606.15032#bib.bib15)\]Video fidelity, semantics, physical plausibilityIVInteractive neural simulatorAutoregressive action\-conditioned video models reused as surrogate environments\[[58](https://arxiv.org/html/2606.15032#bib.bib5),[25](https://arxiv.org/html/2606.15032#bib.bib27),[66](https://arxiv.org/html/2606.15032#bib.bib31),[44](https://arxiv.org/html/2606.15032#bib.bib26),[45](https://arxiv.org/html/2606.15032#bib.bib22)\]Closed\-loop rollout quality, policy ranking, planning successVLatent predictive representationReaction against pixel\-centric evaluation; emphasis on abstraction and planning relevance\[[1](https://arxiv.org/html/2606.15032#bib.bib64),[40](https://arxiv.org/html/2606.15032#bib.bib65),[39](https://arxiv.org/html/2606.15032#bib.bib66)\]Planning, probing, transfer, dense correspondence, value\-relevant featuresVISynthetic\-data engine / executable plannerGenerative models used instrumentally to create trajectories, demonstrations, or plans\[[26](https://arxiv.org/html/2606.15032#bib.bib13),[2](https://arxiv.org/html/2606.15032#bib.bib49),[16](https://arxiv.org/html/2606.15032#bib.bib51),[28](https://arxiv.org/html/2606.15032#bib.bib25),[50](https://arxiv.org/html/2606.15032#bib.bib55)\]Downstream policy lift, executability, action recovery, imitation gainsTable 1:A brief genealogy of the term “world model\.” The same label now covers several partially overlapping objects, which helps explain why evaluation practices have diverged\.
### 2\.2A narrow and a broad reading of “world model”

The survey suggests that it is useful to distinguish two readings of the term\.

#### Narrow reading \(decision\-theoretic\)\.

A world model is an action\-conditioned predictive model that supports counterfactual reasoning about trajectories, outcomes, and values\. In this reading, the primary questions are interventional: what would happen if the agent chose this action sequence, and how good would that be?

#### Broad reading \(predictive or generative\)\.

A world model is any model that predicts or generates future states of the world, whether in pixels, video, latent space, symbolic form, or some other representation\. In this reading, the model may or may not be directly used for planning or control\.

Neither reading is obviously illegitimate\. However, they place pressure on different evaluations\. The narrow reading naturally foregrounds policy relevance\. The broad reading naturally licenses a wider range of predictive or generative metrics\. Much of the confusion in the literature arises when evidence collected under the broad reading is discussed as if it also settled questions posed by the narrow one\.

The broadening of the term can be understood as the result of several converging pressures\.

First,*data availability*shifted\. Large, richly varied video corpora are easier to obtain than large corpora of reset\-matched counterfactual branches with reliable action and reward annotation\. This encourages observation\-centric training and evaluation\.

Second,*modeling scale*shifted\. Video diffusion and transformer\-based generative models became strong enough that predicting future observations started to look like a plausible route to modeling the world, especially in embodied settings where the world is primarily encountered through images and video\[[10](https://arxiv.org/html/2606.15032#bib.bib10),[43](https://arxiv.org/html/2606.15032#bib.bib11)\]\.

Third,*use cases diversified*\. Some researchers want a surrogate evaluator, some want a data engine, some want a planner, some want a representation learner, and some want an open\-ended video generator\. The same family of models can sometimes serve more than one of these roles, but not always equally well\.

Fourth,*interfaces blurred*\. If a model predicts future observations under actions, it can be treated as a simulator; if it predicts latent futures under actions, it can be treated as a planning substrate; if it generates useful rollouts, it can be treated as a data engine\. Once these uses became common, the older environment\-model language and newer generative\-world\-model language naturally merged\.

### 2\.3Controlled\-process view

Letℰ\\mathcal\{E\}denote the real environment, which may be an MDP, a POMDP, a real robot task family, or a trusted simulator used as ground truth\. A trajectory is

τ=\(o0,a0,r0,o1,a1,r1,…,oH\)\.\\tau=\(o\_\{0\},a\_\{0\},r\_\{0\},o\_\{1\},a\_\{1\},r\_\{1\},\\ldots,o\_\{H\}\)\.A history at timettis

ht=\(o0,a0,…,ot\)\.h\_\{t\}=\(o\_\{0\},a\_\{0\},\\ldots,o\_\{t\}\)\.A policyπ\\pimaps histories to actions\. Its discounted return is

Jℰ\(π\)=𝔼τ∼ℰ,π\[∑t=0H−1γtrt\]\.J\_\{\\mathcal\{E\}\}\(\\pi\)=\\mathbb\{E\}\_\{\\tau\\sim\\mathcal\{E\},\\pi\}\\left\[\\sum\_\{t=0\}^\{H\-1\}\\gamma^\{t\}r\_\{t\}\\right\]\.
We will useℰ^\\widehat\{\\mathcal\{E\}\}as a deliberately broad abstraction for the learned object under evaluation:

τ^t:t\+H∼ℰ^\(⋅∣ht,at:t\+H−1,c\),\\widehat\{\\tau\}\_\{t:t\+H\}\\sim\\widehat\{\\mathcal\{E\}\}\(\\cdot\\mid h\_\{t\},a\_\{t:t\+H\-1\},c\),whereccmay include language instructions, goals, embodiment information, camera viewpoint, latent state, or policy context\. Depending on the paper, the output may be pixels, latent states, rewards, values, success probabilities, uncertainty estimates, action proposals, or some combination of these\.

This notation is intentionally more general than the classical transition modelP\(st\+1,rt∣st,at\)P\(s\_\{t\+1\},r\_\{t\}\\mid s\_\{t\},a\_\{t\}\)\. Some surveyed models output only videos; some output only latent states; some output only scores or judgments\. Our notation abstracts over these cases so that differences in evaluation can be discussed in a common language\.

The key point is therefore not output modality\. The key question is what decisions the model is meant to support, and what evidence has been provided for that use\.

### 2\.4Decision uses and evidential burdens

We will useUUto denote the intended*decision use*of the model:

U∈\{prediction,policy evaluation,planning,policy optimization,synthetic data,representation\}\.U\\in\\\{\\text\{prediction\},\\text\{policy evaluation\},\\text\{planning\},\\text\{policy optimization\},\\text\{synthetic data\},\\text\{representation\}\\\}\.Different uses place pressure on different properties\.

- •IfU=predictionU=\\text\{prediction\}, visual fidelity, semantics, and physical plausibility may be primary\.
- •IfU=policy evaluationU=\\text\{policy evaluation\}, policy ranking and value calibration become central\.
- •IfU=planningU=\\text\{planning\}, action sensitivity and long\-horizon outcome fidelity become important\.
- •IfU=policy optimizationU=\\text\{policy optimization\}, exploitability, distribution shift, and uncertainty become especially important\.
- •IfU=synthetic dataU=\\text\{synthetic data\}, the main question becomes whether generated rollouts improve downstream learning\.
- •IfU=representationU=\\text\{representation\}, the emphasis may shift toward abstraction, probing, and planning relevance rather than decoding quality\.

This is why a single use\-independent world\-model score is difficult to defend\. Different tasks and uses genuinely make different demands\.

### 2\.5Open\-loop, closed\-loop, and intervention

Several distinctions are central for the rest of the paper\.

#### Open\-loop vs\. closed\-loop\.

In open\-loop evaluation, the model predicts future observations from a ground\-truth history\. In closed\-loop evaluation, the policy acts on model\-generated histories and the model must remain coherent under its own predictions\. Many world models look strong open\-loop and degrade sharply closed\-loop\.

#### Observational vs\. interventional prediction\.

A model can predict likely futures under the behavior\-policy data distribution without correctly modeling the consequences of*alternative*actions\. For decision\-making, the key object is

ℙℰ\(ot\+1:t\+H∣ht,do\(at:t\+H−1\)\),\\mathbb\{P\}\_\{\\mathcal\{E\}\}\(o\_\{t\+1:t\+H\}\\mid h\_\{t\},do\(a\_\{t:t\+H\-1\}\)\),not merely

ℙℰ\(ot\+1:t\+H∣ht\)\.\\mathbb\{P\}\_\{\\mathcal\{E\}\}\(o\_\{t\+1:t\+H\}\\mid h\_\{t\}\)\.

#### Outcome fidelity\.

For models intended to inform policy choice or policy improvement, preserving task\-relevant outcomes, including rewards, success predicates, progress variables, and constraints, is often more important than preserving every pixel\.

#### Representation sufficiency\.

A model may fail to reconstruct visual detail yet still preserve the abstract state structure needed for planning\. Conversely, a model may reconstruct visually plausible futures while failing to encode the causal distinctions that matter for action selection\[[48](https://arxiv.org/html/2606.15032#bib.bib67)\]\.

## 3What Is Being Evaluated?

This section is intentionally inventory\-like\. TablesLABEL:tab:bench\-survey–LABEL:tab:latent\-surveyenumerate some related recent works, grouped by what they most directly evaluate \(The levels L0\-L7 will be described in the next section\)\. The goal is not to assign blame\. The goal is to separate distinct evaluation cultures that are often collapsed under the single phrase “world\-model evaluation\.”

### 3\.1Benchmark and diagnostic literature

Table 2:Benchmark and diagnostic literature\. The last column lists the main levels of the L0–L7 ladder occupied by the benchmark\.WorkClaimed objectWhat is actually evaluatedRepresentative metrics or outputsMain levelsEVA\-Bench\[[10](https://arxiv.org/html/2606.15032#bib.bib10)\]embodied future\-video anticipationOffline embodied video prediction from current visual context and language; action description, next\-step prediction, how\-to generation, finish\-thinking, and future\-video qualityBLEU, METEOR, ROUGE\-L, CIDEr, SPICE, CLIPScore, GPT\-4o score; SC, BC, MS, FVD, GCE; EVA\-ScoreL1–L3WorldSimBench\[[43](https://arxiv.org/html/2606.15032#bib.bib11)\]video generation models as world simulatorsExplicit perceptual evaluation in open embodied, driving, and manipulation settings; implicit closed\-loop evaluation in Minecraft, CARLA, and CALVINHuman Preference Evaluator scores; route completion, infractions, driving score, resource collection, CALVIN task successL0–L3, L7EWMBench\[[61](https://arxiv.org/html/2606.15032#bib.bib12)\]embodied robot\-video benchmarkScene consistency, trajectory correctness, dynamics, semantics, diversity, and logical consistency in robot manipulation videosSceneC, HSD, nDTW, DYN, Diversity, BLEU, CLIP score, logical\-error penaltyL1–L3DreamGen Bench\[[26](https://arxiv.org/html/2606.15032#bib.bib13)\]benchmark for robot video world models used in policy learningInstruction following and physics alignment across RoboCasa and GR1 object/behavior/environment generalization settingsGPT\-4o, Qwen2\.5\-VL, human evaluation, VideoCon\-Physics; IF, PAL2–L3WoW\-World\-Eval\[[14](https://arxiv.org/html/2606.15032#bib.bib24)\]embodied world\-model Turing testVideo quality, instruction understanding, physical law, planning DAG quality, replay executability, OOD generalization, and human deceptionFVD/PSNR/SSIM/DINO/DreamSim; sequence match and execution quality; trajectory L2/DTW/FD; replay success; deceive\-human ratioL0–L4, L7RBench\[[11](https://arxiv.org/html/2606.15032#bib.bib23)\]robot\-oriented video generation benchmarkTask correctness and embodiment\-specific plausibility for common manipulation, long\-horizon planning, collaboration, spatial relations, and visual reasoningVLM/LLM judging of PSS, TAC, RSS; programmatic motion amplitude and smoothnessL0–L3WorldArena\[[45](https://arxiv.org/html/2606.15032#bib.bib22)\]unified benchmark for perception and functional utilityOpen\-loop video quality plus world model as data engine, policy evaluator, and action planner16 video metrics; EWMScore; policy performance gain; correlation with simulator; planner success; human evaluationL0–L4, L6, L7RoboWM\-Bench\[[28](https://arxiv.org/html/2606.15032#bib.bib25)\]benchmark for world models in robotic manipulationWhether generated human\-hand or robot videos can be converted into executable actions that complete tasks in simulationTask\-level and step\-level success; real\-to\-sim consistency; video\-to\-action replay reliabilityL4, L7WorldScore\[[13](https://arxiv.org/html/2606.15032#bib.bib16)\]world\-generation benchmarkStatic and dynamic world generation under layout and camera controlCamera control, object control, content alignment, 3D consistency, photo consistency, motion accuracy, motion smoothness, WorldScoreL0–L3WorldModelBench\[[32](https://arxiv.org/html/2606.15032#bib.bib14)\]benchmark for judging video generators as world modelsInstruction following, commonsense, and physics adherence of generated videos across domainsHuman labels and trained VLM judge; instruction\-following level, framewise/temporal quality, physics sub\-scores, ELOL2–L3WorldPrediction\[[6](https://arxiv.org/html/2606.15032#bib.bib17)\]high\-level world modeling and procedural planning benchmarkMultiple\-choice action or action\-sequence selection from initial and final statesSingle\-step world\-modeling accuracy; multi\-step procedural\-planning accuracyL4, L7AutumnBench / WorldTest\[[52](https://arxiv.org/html/2606.15032#bib.bib21)\]environment\-level query benchmark after explorationMasked frame prediction, change detection, and planning in text\-based grid\-world POMDPsMFP accuracy, change\-detection score, planning success, aggregate scoreL1, L4, L7PBench\[[42](https://arxiv.org/html/2606.15032#bib.bib15)\]Physical\-AI image\-to\-video benchmarkDomain\-specific physical and commonsense QA plus generic video qualityDomain score via Qwen2\.5\-VL QA; VBench\-style quality score; overall scoreL0–L3MVP\[[31](https://arxiv.org/html/2606.15032#bib.bib18)\]shortcut\-resistant physical video QAMinimal\-pair physical understanding across human\-object, robot\-object, intuitive\-physics, and temporal\-reasoning settingsPaired minimal\-pair accuracyL3IntPhys 2\[[4](https://arxiv.org/html/2606.15032#bib.bib19)\]intuitive\-physics benchmarkPossible vs\. impossible events in complex synthetic scenesOverall and difficulty\-split accuracy; pairwise and single\-video evaluationL3CausalVQA\[[15](https://arxiv.org/html/2606.15032#bib.bib20)\]causal reasoning benchmark for video modelsCounterfactual, hypothetical, anticipation, planning, and descriptive reasoning on real egocentric videosPaired accuracy, unpaired accuracy, reasoning accuracy, difficulty splits, human baselineL3The benchmark literature has clearly moved beyond pure FID/FVD\. Newer suites ask about instruction following, physical plausibility, trajectory correctness, executability, and even policy\-centric use\[[43](https://arxiv.org/html/2606.15032#bib.bib11),[45](https://arxiv.org/html/2606.15032#bib.bib22),[28](https://arxiv.org/html/2606.15032#bib.bib25)\]\. Nevertheless, many benchmark suites still place most of their evaluative weight on*artifact quality*: the generated video, description, or QA answer, rather than on whether the model supports reliable policy evaluation or policy improvement\.

### 3\.2Environment\-model and policy\-optimization literature

Table 3:Environment\-model and policy\-optimization literature\. These papers are closer to the original decision\-making use of world models, but they vary substantially in how directly they evaluate that use\.WorkPrimary use claimWhat is actually evaluatedRepresentative metrics or outputsMain levelsWHALE\[[63](https://arxiv.org/html/2606.15032#bib.bib6)\]generalizable embodied decision world modelValue estimation, video fidelity, uncertainty estimation, and downstream offline policy optimization under generalization shiftValue\-estimation quality, video fidelity, uncertainty quality, policy optimization gainsL4–L7ACEM / GALILEO\[[8](https://arxiv.org/html/2606.15032#bib.bib7)\]counterfactual environment\-model learningCounterfactual prediction, off\-policy evaluation, offline RL, and online decision making under behavior\-policy biasCounterfactual prediction accuracy, OPE quality, policy\-improvement performanceL4, L6, L7ADM\-v2\[[37](https://arxiv.org/html/2606.15032#bib.bib9)\]full\-horizon dynamics model for offline learningFull\-horizon rollout quality via off\-policy evaluation and offline RLOPE reliability, full\-horizon rollout performance, offline RL returnsL4, L6, L7PCM\[[7](https://arxiv.org/html/2606.15032#bib.bib8)\]policy\-conditioned environment modelValue estimation, policy selection, and MPC under policy\-distribution shiftValue\-gap reduction, policy\-evaluation quality, MPC performanceL4, L6, L7DayDreamer\[[53](https://arxiv.org/html/2606.15032#bib.bib4)\]world model for real\-robot RLEnd\-to\-end policy learning on real robots; no standalone world\-model benchmarkPolicy success, sample efficiency, robot learning timeL7UniSim\[[58](https://arxiv.org/html/2606.15032#bib.bib5)\]interactive real\-world simulatorVideo\-generation quality plus downstream policy learning and synthetic\-video\-based captioningFID, FVD, IS, CLIP; Language Table success; CIDErL0–L1, L7DiWA\[[5](https://arxiv.org/html/2606.15032#bib.bib35)\]diffusion\-policy adaptation with world modelsReward classifier quality, imagined PPO updates, and downstream policy adaptationPrecision/Recall for reward; policy success; sample efficiencyL5, L7World4RL\[[29](https://arxiv.org/html/2606.15032#bib.bib36)\]diffusion world model for policy refinementOpen\-loop video quality plus manipulation policy refinement in simulation and real robotsFID, FVD, LPIPS; success rate; interaction costL0–L1, L7VLA\-RFT\[[33](https://arxiv.org/html/2606.15032#bib.bib37)\]world\-simulator RL fine\-tuning for VLAsImage prediction quality and downstream success/robustness on LIBEROMSE, PSNR, SSIM, LPIPS; success rate; perturbation successL1, L7ProphRL\[[62](https://arxiv.org/html/2606.15032#bib.bib38)\]future\-video world model with VLM rewardVideo prediction, optical\-flow correctness, reward precision/recall, and policy successPSNR, SSIM, tSSIM, flow EPE/cosine; RM precision/recall/FPR; successL1, L5, L7WMPO\[[67](https://arxiv.org/html/2606.15032#bib.bib39)\]world\-model\-based policy optimization for VLA modelsPolicy optimization outcome, successful\-trajectory length, and continual\-learning performanceSuccess rate, successful trajectory length, lifelong\-learning successL7RehearseVLA\[[54](https://arxiv.org/html/2606.15032#bib.bib40)\]simulated post\-training with physically consistent world modelVideo quality, reward\-model quality, and policy success in simulation and real robotsFID, FVD, PSNR, SSIM, LPIPS; RM accuracy/precision/recall/F1; successL1, L5, L7World\-Gymnast\[[46](https://arxiv.org/html/2606.15032#bib.bib41)\]VLA RL inside a video world modelMainly downstream policy performance under imagined RLSuccess rate on tabletop and real\-robot tasksL7RISE\[[56](https://arxiv.org/html/2606.15032#bib.bib42)\]self\-improving robot policy with compositional world modelMulti\-view world\-model quality, progress\-value modeling, and real\-robot policy successPSNR, SSIM, LPIPS, FVD, EPE; success and sub\-step scoresL1, L5, L7VLAW\[[21](https://arxiv.org/html/2606.15032#bib.bib43)\]iterative co\-improvement of VLA and world modelVideo quality, interaction\-event correctness, reward quality, and policy successPSNR, SSIM, LPIPS, FID, FVD; TP/FN/TN/FP for interaction and reward; successL1, L4, L5, L7GigaBrain\-0\.5M\[[19](https://arxiv.org/html/2606.15032#bib.bib44)\]VLA trained with world\-model\-based RLProcess reward/value prediction quality and downstream real\-robot successInference time, MAE, MSE, RMSE, Kendall’s tau; success rateL5, L7WoVR\[[30](https://arxiv.org/html/2606.15032#bib.bib45)\]reliable simulator for VLA post\-trainingLong\-horizon video quality, speed, and downstream policy successLPIPS, FID, FVD, FloLPIPS, FPS; success rateL1, L7World\-VLA\-Loop\[[38](https://arxiv.org/html/2606.15032#bib.bib46)\]closed\-loop co\-training of world model and VLA policyWorld\-model image quality, reward accuracy, and final successSSIM, PSNR, LPIPS, MSE; reward accuracy; success rateL1, L5, L7PlayWorld\[[60](https://arxiv.org/html/2606.15032#bib.bib47)\]world model learned from autonomous playVideo quality, progress\-reward modeling, failure\-mode alignment, and policy successLPIPS, SSIM, PSNR, MSE; RM accuracy; failure\-mode alignment; successL1, L5, L7VLA\-MBPO\[[64](https://arxiv.org/html/2606.15032#bib.bib48)\]practical model\-based RL for VLA modelsImage prediction, reward prediction, inference time, rollout\-length ablation, and downstream successLPIPS, PSNR, SSIM; reward ACC/F1; inference time; successL1, L5, L7This family lies closer to the decision\-making end of the spectrum because it evaluates what policies achieve when the model is used for imagination, fine\-tuning, or planning\. However, many papers in this family combine lower\-level video diagnostics with final policy success, leaving the contribution of the world model itself only partially isolated\. The clearest exceptions are the papers that make counterfactual generalization, full\-horizon OPE, policy shift, or uncertainty an explicit part of the evaluation, most notably ACEM, PCM, ADM\-v2, and WHALE\[[8](https://arxiv.org/html/2606.15032#bib.bib7),[7](https://arxiv.org/html/2606.15032#bib.bib8),[37](https://arxiv.org/html/2606.15032#bib.bib9),[63](https://arxiv.org/html/2606.15032#bib.bib6)\]\.

### 3\.3Policy evaluation, executability, and synthetic\-data literature

Table 4:Policy\-evaluation, executability, and synthetic\-data literature\. These works are especially important because they move from artifact quality toward decision utility, though often through pipeline\-level metrics\.WorkPrimary use claimWhat is actually evaluatedRepresentative metrics or outputsMain levelsWorldGym\[[44](https://arxiv.org/html/2606.15032#bib.bib26)\]world model as environment for policy evaluationQualitative rollout fidelity, qualitative action controllability, and correlation between model\-evaluated and real policy successPearson correlation; relative policy rankingL4, L6Vid2World\[[25](https://arxiv.org/html/2606.15032#bib.bib27)\]interactive world model from video diffusionVideo prediction quality and policy evaluation on robotic manipulationFVD, FID, SSIM, PSNR, LPIPS, DreamSim; simulated vs\. real successL1, L6Scalable Policy Evaluation\[[47](https://arxiv.org/html/2606.15032#bib.bib28)\]video world models as policy evaluatorsVideo quality plus policy\-value correlation and ranking fidelityPSNR, SSIM, FVD, latent L2; Pearson correlation; MMRVL1, L6Gemini/Veo Simulator\[[18](https://arxiv.org/html/2606.15032#bib.bib29)\]world simulator for policy evaluation and safetyNominal evaluation, OOD ranking, and safety red\-teaming in a video simulatorPearson correlation, MMRV, qualitative safety findingsL6dWorldEval\[[34](https://arxiv.org/html/2606.15032#bib.bib30)\]robotic policy evaluation via diffusion world modelAction controllability, round\-trip consistency, and policy\-ranking fidelityΔ\\Delta\-LPIPS, round\-trip LPIPS, Pearson correlation, MMRVL4, L6IRASim\[[66](https://arxiv.org/html/2606.15032#bib.bib31)\]fine\-grained world model for manipulationVideo prediction, qualitative flexible controllability, policy evaluation, and model\-based planningPSNR, SSIM, latent L2, FID, FVD; Pearson correlation; planning successL1, L4, L6, L7DreamDojo\[[17](https://arxiv.org/html/2606.15032#bib.bib32)\]generalist robot world modelVideo prediction, human judgment of physics/action following, policy evaluation, and planningPSNR, SSIM, LPIPS; human physics/action\-following; success, Pearson, MMRVL1, L3, L6, L7Kinema4D\[[55](https://arxiv.org/html/2606.15032#bib.bib33)\]4D kinematic world modelRGB rollout quality, geometry quality, and policy\-evaluation qualityPSNR, SSIM, latent L2, FID, FVD, LPIPS; Chamfer, F\-score, temporal F\-score; success\-gap to ground truthL1, L3, L6Persistent Robot World Models\[[3](https://arxiv.org/html/2606.15032#bib.bib34)\]stabilized multi\-step rollouts for policy evaluationPer\-camera rollout quality, masked task\-relevant metrics, human preference, and policy rankingSSIM, PSNR, LPIPS; temporal curves; masked metrics; 2AFC/ELO; Pearson, MMRVL1, L6DreMa\[[2](https://arxiv.org/html/2606.15032#bib.bib49)\]synthetic imagination for imitation learningObject final\-position accuracy and downstream imitation\-learning gainsObject final\-position error; policy successL4, L7DreamGen\[[26](https://arxiv.org/html/2606.15032#bib.bib13)\]synthetic robot\-video data engineVLM/human instruction following and physics alignment plus downstream policy learning and generalizationIF, PA; downstream policy successL2, L3, L7Ctrl\-World\[[22](https://arxiv.org/html/2606.15032#bib.bib50)\]controllable generative world modelVideo quality, qualitative action controllability, policy evaluation correlation, and policy improvement with synthetic successful trajectoriesPSNR, SSIM, LPIPS, FID, FVD; evaluation correlation; success rateL1, L4, L6, L7RoboMaster\[[16](https://arxiv.org/html/2606.15032#bib.bib51)\]embodied action planning from generated videosBridge video quality, trajectory fidelity, user preference, action\-planning success, and IDM action qualityFVD, PSNR, SSIM, TrajError, user preference, planning successL1, L4, L7GigaWorld\-0\[[20](https://arxiv.org/html/2606.15032#bib.bib52)\]world\-model\-based data enginePhysical\-AI/world\-generation quality, filtering score, and action recovery from generated dataPBench, DreamGen Bench, quality filtering, IDM action recoveryL0–L4RoboVIP\[[49](https://arxiv.org/html/2606.15032#bib.bib53)\]multi\-view video augmentation for manipulationMulti\-view generation quality and downstream policy gainsFID, FVD, LPIPS, MV\-Mat; success rateL1, L7Interactive World Simulator\[[51](https://arxiv.org/html/2606.15032#bib.bib54)\]interactive simulator for policy training and evaluationVideo prediction, speed/stability, simulator\-collected training data, and evaluation correlationMSE, LPIPS, FID, PSNR, SSIM, UIQI, FVD; FPS/stability; success; correlationL1, L6, L7VLP\[[12](https://arxiv.org/html/2606.15032#bib.bib56)\]video language planningHuman judgment of long\-horizon video\-plan completion and downstream execution qualityHuman plan\-completion rate; task reward/completionL2, L7Dreamitate\[[35](https://arxiv.org/html/2606.15032#bib.bib57)\]video generation for visuomotor policy learningReal\-robot policy success using generated videos as supervision or guidanceSuccess rateL7RoboDreamer\[[65](https://arxiv.org/html/2606.15032#bib.bib58)\]compositional world model for robot imaginationVideo generation quality, human task completion judgment, and execution successFVD; human completion judgment; RLBench successL0, L2, L7RoboEnvision\[[57](https://arxiv.org/html/2606.15032#bib.bib59)\]long\-horizon robot video generationLong\-horizon video quality and downstream policy\-model task successLPIPS, SSIM, PSNR, FVD, CLIP score; success rateL1, L2, L7Genie Envisioner\[[36](https://arxiv.org/html/2606.15032#bib.bib60)\]unified platform for manipulationReal\-robot action\-model success and EWMBench\-style scene/motion/semantic world\-model evaluationSuccess rate; SceneC, HSD, nDTW, DYN, Diversity, BLEU, CLIP, LogicsL1, L2, L3, L7EVA \(model\)\[[50](https://arxiv.org/html/2606.15032#bib.bib55)\]executable video world model via IDM rewardsHuman judgment of kinematic plausibility, interaction plausibility, instruction adherence, and execution success in sim and real robotsHuman ratings; simulator success; real\-robot success; IDM task successL2, L3, L7This family is central for a decision\-making\-centric paper because it contains some of the strongest current attempts to evaluate world models as*functional*objects rather than as mere generators\. The key move is away from asking only “Does the video look plausible?” and toward asking “Does the model rank policies correctly?”, “Can it support planning?”, or “Can generated futures be executed?” At the same time, many of these are still*pipeline metrics*: executability depends on inverse dynamics or retargeting; synthetic\-data value depends on the downstream learner; policy\-evaluation correlation depends on reward checkers and the policy set being ranked\.

### 3\.4Latent and foundation\-model literature

Table 5:Latent and foundation\-model literature\. “repr\.” denotes cross\-cutting state\-abstraction or representation diagnostics that do not map cleanly to a single rung of the ladder but matter chiefly because they support higher\-level decision claims\.WorkPrimary use claimWhat is actually evaluatedRepresentative metrics or outputsMain levelsCosmos Predict 2\.5\[[41](https://arxiv.org/html/2606.15032#bib.bib61)\]diffusion\-based world foundation modelPhysical\-AI domain competence, generic video quality, multi\-view robot geometry, action\-conditioned robot prediction, and DreamGen\-Bench performanceDomain score, quality score, TransErr, RotErr, Sampson error, PSNR, SSIM, latent L2, FVDL0–L4ABot\-PhysWorld\[[9](https://arxiv.org/html/2606.15032#bib.bib62)\]interactive world foundation model for manipulationPBench/EZS\-Bench world\-generation quality and action\-to\-video trajectory consistencyDomain/Robot score, VBench\-style quality metrics, trajectory consistencyL0–L4EA\-WM\[[59](https://arxiv.org/html/2606.15032#bib.bib63)\]event\-aware generative world modelInteraction quality, trajectory accuracy, depth accuracy, perspectivity, instruction following, semantic alignment, action following, and action recoverabilityVLM interaction and perspective scores; trajectory NDTW; depth accuracy; KVAF translation/rotation/gripper errorL2–L4V\-JEPA 2\[[1](https://arxiv.org/html/2606.15032#bib.bib64)\]latent predictive world model for understanding, prediction, and planningQualitative decoded rollouts, robot planning, action anticipation, classification, and video QAGoal distance, success rate, planning time, Top\-1 accuracy, Recall@5, VQA accuracyL7 \+ repr\.V\-JEPA 2\.1\[[40](https://arxiv.org/html/2606.15032#bib.bib65)\]dense\-feature latent predictive modelRobot planning, navigation, dense prediction, action anticipation, classification, and video QASuccess rate, ATE/RTE, RMSE, mIoU, J&F, Recall@5, Top\-1 accuracyL7 \+ repr\.LeWorldModel\[[39](https://arxiv.org/html/2606.15032#bib.bib66)\]latent JEPA\-style world model from pixelsLatent\-space MPC planning, latent physical probing, and detection of physically implausible latent trajectoriesPlanning success, MSE, Pearson correlation, anomaly separationL5, L7 \+ repr\.Implicit World Model Evaluation\[[48](https://arxiv.org/html/2606.15032#bib.bib67)\]formal evaluation of the state\-transition structure learned by generative modelsWhether the model merges histories that correspond to the same true state and separates histories with different future possibilitiesSequence compression and sequence distinctionrepr\.These papers make two important points\. First, a world model need not be a pixel generator: V\-JEPA 2, V\-JEPA 2\.1, and LeWorldModel are evaluated mainly through planning, probing, and downstream utility\[[1](https://arxiv.org/html/2606.15032#bib.bib64),[40](https://arxiv.org/html/2606.15032#bib.bib65),[39](https://arxiv.org/html/2606.15032#bib.bib66)\]\. Second, strong next\-step or surface\-level prediction can be misleading about state abstraction: Vafa et al\. show that a model may appear strong under standard probes while still failing to recover coherent transition structure\[[48](https://arxiv.org/html/2606.15032#bib.bib67)\]\. This matters because a decision world model may ultimately depend more on the right*state\-transition abstraction*than on the right pixels\.

### 3\.5What is actually being evaluated?

Taken together, TablesLABEL:tab:bench\-survey–LABEL:tab:latent\-surveysuggest that the literature is not evaluating a single object\.

#### First, there are multiple evaluation cultures\.

Benchmark papers such as EVA\-Bench, EWMBench, WorldModelBench, PBench, RBench, and WorldScore mainly evaluate the*generated artifact*: realism, semantics, or physics\[[10](https://arxiv.org/html/2606.15032#bib.bib10),[61](https://arxiv.org/html/2606.15032#bib.bib12),[32](https://arxiv.org/html/2606.15032#bib.bib14),[42](https://arxiv.org/html/2606.15032#bib.bib15),[11](https://arxiv.org/html/2606.15032#bib.bib23),[13](https://arxiv.org/html/2606.15032#bib.bib16)\]\. Policy\-evaluation papers such as WorldGym, Scalable Policy Evaluation, dWorldEval, and DreamDojo evaluate the model as a*surrogate evaluator of policies*\[[44](https://arxiv.org/html/2606.15032#bib.bib26),[47](https://arxiv.org/html/2606.15032#bib.bib28),[34](https://arxiv.org/html/2606.15032#bib.bib30),[17](https://arxiv.org/html/2606.15032#bib.bib32)\]\. Optimization papers evaluate the model as part of an*end\-to\-end control\-improvement pipeline*\[[53](https://arxiv.org/html/2606.15032#bib.bib4),[29](https://arxiv.org/html/2606.15032#bib.bib36),[33](https://arxiv.org/html/2606.15032#bib.bib37),[67](https://arxiv.org/html/2606.15032#bib.bib39)\]\. Latent papers evaluate the model as a*planning\-relevant representation*\[[1](https://arxiv.org/html/2606.15032#bib.bib64),[40](https://arxiv.org/html/2606.15032#bib.bib65),[39](https://arxiv.org/html/2606.15032#bib.bib66)\]\. Synthetic\-data and executability papers evaluate the model as a*data engine or executable planner*\[[26](https://arxiv.org/html/2606.15032#bib.bib13),[28](https://arxiv.org/html/2606.15032#bib.bib25),[16](https://arxiv.org/html/2606.15032#bib.bib51),[50](https://arxiv.org/html/2606.15032#bib.bib55)\]\.

#### Second, the literature is strongest when the claim and the evaluation match\.

DayDreamer, ACEM, PCM, ADM\-v2, WHALE, WorldGym, dWorldEval, and WorldArena are conceptually clearer because their evaluations directly target the use they claim: control, OPE, policy shift, full\-horizon rollout, policy evaluation, or functional utility\[[53](https://arxiv.org/html/2606.15032#bib.bib4),[8](https://arxiv.org/html/2606.15032#bib.bib7),[7](https://arxiv.org/html/2606.15032#bib.bib8),[37](https://arxiv.org/html/2606.15032#bib.bib9),[63](https://arxiv.org/html/2606.15032#bib.bib6),[44](https://arxiv.org/html/2606.15032#bib.bib26),[34](https://arxiv.org/html/2606.15032#bib.bib30),[45](https://arxiv.org/html/2606.15032#bib.bib22)\]\. By contrast, many works quite reasonably evaluate lower\-level artifact properties, but that evidence is sometimes discussed as if it also established a stronger decision\-making claim\.

#### Third, the main fault line is not simply “video vs\. RL\.”

The deeper distinction is between evaluating*the generated artifact*and evaluating*the decisions enabled by the model*\. Many VLA/RL papers report policy success but diagnose the world model mainly through L1\-style reconstruction metrics\. Many benchmark papers add semantics and physics but remain artifact\-centered\. Policy\-evaluation and executability papers move closer to decision utility, but often through black\-box or pipeline\-level metrics\.

#### Fourth, pixels are neither necessary nor sufficient\.

V\-JEPA 2, V\-JEPA 2\.1, and LeWorldModel show that a model can be useful for planning without photorealistic reconstruction\[[1](https://arxiv.org/html/2606.15032#bib.bib64),[40](https://arxiv.org/html/2606.15032#bib.bib65),[39](https://arxiv.org/html/2606.15032#bib.bib66)\]\. Conversely, WorldArena and RoboWM\-Bench make explicit that visual quality can diverge from functional embodied utility\[[45](https://arxiv.org/html/2606.15032#bib.bib22),[28](https://arxiv.org/html/2606.15032#bib.bib25)\]\.

### 3\.6The world\-model claim

After reading the literature, we find that different papers use the same term, “world model”, while implicitly making different claims:

1. 1\.Future\-video claim:the model predicts plausible or realistic future observations\.
2. 2\.Policy\-evaluation claim:the model can estimate or rank the performance of candidate policies\.
3. 3\.Policy\-optimization claim:using the model inside a planner, optimizer, or RL loop improves policies in the real environment\.
4. 4\.Planning/executability claim:the model can help choose or recover action sequences that succeed in the real environment\.
5. 5\.Synthetic\-data claim:model\-generated samples improve downstream learning\.
6. 6\.Representation claim:the model’s latent state preserves task\-relevant transition structure sufficient for prediction and control\.

These claims are not interchangeable\. Evidence for one of them is not automatically evidence for another\.

Claim typeWhat stronger supporting evidence would typically includeCommon weaker evidence sometimes substitutedWhy the substitution can be misleadingFuture\-video / world\-generation claimL0–L3 evidence: visual plausibility, logged\-future fidelity, semantic alignment, physical plausibilityNone; these may be appropriate if this is the actual claimThe problem starts only when this evidence is later treated as if it also established policy evaluation or optimization utility\.Policy\-evaluation claimClosed\-loop rollouts of fixed policies with value/ranking agreement, ideally supported by L4–L5 diagnosticsFVD/PSNR, human preference, or instruction\-following scores aloneA model can generate plausible videos while misranking policies or misestimating success\.Policy\-optimization claimReal or trusted\-sim policy lift under fixed budgets, plus exploitability tests and L4–L6 diagnosticsOpen\-loop image/video metrics, or a single downstream success number without decompositionFinal success entangles the world model with reward models, filters, optimizers, and data curation; open\-loop metrics do not test counterfactual usefulness\.Planning / executability claimTask success under model\-based planning or action recovery, ideally with action\-counterfactual diagnosticsInstruction\-following videos or human preference for plausible plansA plausible\-looking plan can still be dynamically wrong or non\-executable\.Synthetic\-data claimControlled downstream learning gains under matched training budgets and learnersVideo aesthetics, FVD, or VLM preference over generated dataBeautiful data need not contain the right task\-relevant variation for learning\.Representation claimPlanning utility, physical probes, or state\-abstraction diagnostics such as sequence compression/distinctionDecoder quality or linear probes aloneA model can reconstruct well or support shallow probing while still lacking the correct transition structure\.Table 6:A recurring issue in the literature is claim/evidence mismatch: lower\-level evidence is sometimes taken to support stronger world\-model claims than it can comfortably justify\.This observation can be made concrete in the recent literature\. Benchmarks such as PBench, WorldModelBench, WorldScore, and RBench are informative if the claim is embodied world generation or physical\-video quality, but they do not by themselves establish policy\-evaluation or policy\-optimization utility\[[42](https://arxiv.org/html/2606.15032#bib.bib15),[32](https://arxiv.org/html/2606.15032#bib.bib14),[13](https://arxiv.org/html/2606.15032#bib.bib16),[11](https://arxiv.org/html/2606.15032#bib.bib23)\]\. Many RL/VLA papers such as World4RL, VLA\-RFT, RehearseVLA, RISE, WoVR, and VLA\-MBPO evaluate final policy success, which is important, but often diagnose the world model itself mainly through L1\-style image or video metrics\[[29](https://arxiv.org/html/2606.15032#bib.bib36),[33](https://arxiv.org/html/2606.15032#bib.bib37),[54](https://arxiv.org/html/2606.15032#bib.bib40),[56](https://arxiv.org/html/2606.15032#bib.bib42),[30](https://arxiv.org/html/2606.15032#bib.bib45),[64](https://arxiv.org/html/2606.15032#bib.bib48)\]\. Policy\-evaluation papers such as WorldGym, Scalable Policy Evaluation, Vid2World, dWorldEval, DreamDojo, and Persistent Robot World Models are closer to the decision\-making use, yet they mostly evaluate fixed\-policy ranking rather than optimizer interaction or exploitability\[[44](https://arxiv.org/html/2606.15032#bib.bib26),[47](https://arxiv.org/html/2606.15032#bib.bib28),[25](https://arxiv.org/html/2606.15032#bib.bib27),[34](https://arxiv.org/html/2606.15032#bib.bib30),[17](https://arxiv.org/html/2606.15032#bib.bib32),[3](https://arxiv.org/html/2606.15032#bib.bib34)\]\. Executability papers such as RoboWM\-Bench, RoboMaster, and EVA test a stronger bridge from video to control, but their performance also depends on inverse dynamics, retargeting, simulators, and success checkers\[[28](https://arxiv.org/html/2606.15032#bib.bib25),[16](https://arxiv.org/html/2606.15032#bib.bib51),[50](https://arxiv.org/html/2606.15032#bib.bib55)\]\. Finally, latent papers show the converse point: planning utility can exist without high\-quality pixels\[[1](https://arxiv.org/html/2606.15032#bib.bib64),[39](https://arxiv.org/html/2606.15032#bib.bib66),[48](https://arxiv.org/html/2606.15032#bib.bib67)\]\.

The conclusion is not that the literature has been “doing it wrong\.” Rather, the field has been evaluating several different things under one name\. The next section formalizes this with an evaluation ladder\.

## 4World\-Model Evaluation Ladder

The survey suggests an L0–L7 ladder of evaluation targets\. The ladder is*descriptive*because it summarizes what the literature already measures\. It is also, to some extent,*normative*, because the levels answer increasingly strong questions about whether a model supports embodied decision\-making\. The point is not that lower levels are useless\. The point is that lower levels generally answer weaker questions about decision use\.

LevelNameCore questionTypical metricsRepresentative worksL0Visual plausibilityDoes the output look like a realistic image or video?FID, FVD, aesthetics, image quality, human preference, VBench\-style scoresUniSim, WorldScore, PBench, Cosmos Predict 2\.5, ABot\-PhysWorldL1Logged\-future predictionDoes the predicted future match held\-out trajectories from the behavior distribution?MSE, PSNR, SSIM, LPIPS, DreamSim, latent L2, tSSIM, optical\-flow errorWorld4RL, VLA\-RFT, ProphRL, IRASim, Kinema4D, PersistentL2Semantic alignmentDoes the rollout match the instruction, task, and scene semantics?CLIPScore, caption metrics, VLM/LLM judges, task\-completion labelsEVA\-Bench, DreamGen Bench, RBench, WorldArena, EA\-WML3Physical plausibilityDoes the rollout obey intuitive physical and geometric constraints?Physics QA, VLM physics scores, object permanence, depth, contact, trajectory, 3D consistencyWorldModelBench, PBench, MVP, IntPhys 2, CausalVQA, WorldArenaL4Action controllability / counterfactual fidelityDoes changing actions produce the correct task\-relevant changes?Action\-effect tests, trajectory error,Δ\\Delta\-LPIPS, round\-trip consistency, counterfactual OPE diagnosticsACEM, PCM, WHALE, dWorldEval, WorldGym, IRASim, RoboMasterL5Reward, value, and outcome fidelityDoes the model predict success, reward, progress, or value accurately enough for decision making?Reward accuracy, precision/recall/F1, calibration, value error, Kendall’s tau, progress rankingDiWA, ProphRL, RISE, VLAW, GigaBrain, PlayWorld, VLA\-MBPOL6Policy evaluation and rankingDoes model\-based evaluation agree with real/simulator policy performance?Pearson/Spearman correlation, pairwise ranking accuracy, MMRV, calibrationWorldGym, Vid2World, Scalable Policy Evaluation, dWorldEval, DreamDojoL7Policy optimization / planning utilityDoes using the model improve decisions?Policy lift, optimization regret, sample efficiency, planning success, safe improvement, exploitability gapDayDreamer, ADM\-v2, WMPO, DreamGen, VLP, EVA, V\-JEPA 2Table 7:The L0–L7 ladder\. Levels are ordered by the strength of the question they answer about world\-model usefulness for embodied decision making\.The levels are not mutually exclusive, and they are not strictly monotone in practice\. A latent model might score modestly on L0 or L1 and still be useful at L7; a video model might score highly at L0–L3 and still be weak at L6\. The ladder is therefore best read as a hierarchy of increasingly direct evidence for decision\-making claims, not as a simple linear scorecard\.

L0 \(Visual plausibility\)asks whether the output*looks*like a plausible image or video\. This is the dominant level in video\-generation evaluations, where metrics such as FID, FVD, aesthetics, image quality, and human preference are natural first checks\[[58](https://arxiv.org/html/2606.15032#bib.bib5),[13](https://arxiv.org/html/2606.15032#bib.bib16),[42](https://arxiv.org/html/2606.15032#bib.bib15),[41](https://arxiv.org/html/2606.15032#bib.bib61),[9](https://arxiv.org/html/2606.15032#bib.bib62)\]\. L0 is useful because obvious visual failures often indicate model collapse, temporal artifacts, or poor conditioning\. It also matters for settings where humans inspect rollouts or where generated data must pass a basic realism filter\.

But L0 is only indirect evidence of decision usefulness\. A model can attain good L0 by producing smooth videos, copying backgrounds, or generating semantically plausible but action\-insensitive futures\. L0 therefore diagnoses surface quality, not whether the model gets the*consequences of chosen actions*right\.

L1 \(Logged\-future prediction\)asks whether a predicted future matches a held\-out future from the behavior distribution\. This is the standard open\-loop world\-model setting of MSE, PSNR, SSIM, LPIPS, DreamSim, latent L2, temporal SSIM, or optical\-flow accuracy\[[29](https://arxiv.org/html/2606.15032#bib.bib36),[33](https://arxiv.org/html/2606.15032#bib.bib37),[62](https://arxiv.org/html/2606.15032#bib.bib38),[66](https://arxiv.org/html/2606.15032#bib.bib31),[55](https://arxiv.org/html/2606.15032#bib.bib33),[3](https://arxiv.org/html/2606.15032#bib.bib34)\]\. L1 is a valuable sanity check: if a model cannot predict held\-out futures from the data distribution at all, it is less likely to support higher\-level use\.

However, L1 remains observational\. It measures whether the model matches*what happened*, not whether it predicts*what would happen under different actions or policies*\. In stochastic settings, L1 can also punish plausible alternative futures\. A model can therefore be mediocre at L1 but still useful for planning if it preserves reward\-relevant structure; conversely, it can be strong at L1 yet fail under policy shift\.

L2 \(Semantic alignment\)evaluates whether the rollout matches the instruction, task, objects, and scene semantics\. This level appears prominently in EVA\-Bench, DreamGen Bench, RBench, WorldArena, WoW\-World\-Eval, and EA\-WM\[[10](https://arxiv.org/html/2606.15032#bib.bib10),[26](https://arxiv.org/html/2606.15032#bib.bib13),[11](https://arxiv.org/html/2606.15032#bib.bib23),[45](https://arxiv.org/html/2606.15032#bib.bib22),[14](https://arxiv.org/html/2606.15032#bib.bib24),[59](https://arxiv.org/html/2606.15032#bib.bib63)\]\. Metrics include CLIPScore, caption overlap, VLM/LLM judges, and task\-completion labels\.

L2 is often important for language\-conditioned systems because a model that misunderstands the task is unlikely to support meaningful downstream use\. Still, L2 remains artifact\-centered\. VLM judges and semantic similarity scores can reward plausible narratives rather than faithful dynamics\. A video may look as if it follows the instruction while still encoding the wrong object contact, wrong force direction, or wrong success state\.

L3 \(Physical plausibility\)asks whether the rollout obeys intuitive physics and geometric consistency: object permanence, continuity, gravity, non\-penetration, contact, depth, or trajectory coherence\. This level is prominent in WorldModelBench, PBench, WorldArena, MVP, IntPhys 2, CausalVQA, and EA\-WM\[[32](https://arxiv.org/html/2606.15032#bib.bib14),[42](https://arxiv.org/html/2606.15032#bib.bib15),[45](https://arxiv.org/html/2606.15032#bib.bib22),[31](https://arxiv.org/html/2606.15032#bib.bib18),[4](https://arxiv.org/html/2606.15032#bib.bib19),[15](https://arxiv.org/html/2606.15032#bib.bib20),[59](https://arxiv.org/html/2606.15032#bib.bib63)\]\. Relative to L0 and L2, L3 is a meaningful advance because it penalizes physically impossible or causally incoherent artifacts\.

Yet L3 is still not enough for decision use unless it is tied to interventions\. Many physical benchmarks ask whether a video looks physically plausible or whether a model answers a physical question correctly\. Those are valuable diagnostics, but they do not directly test whether the model predicts the consequences of*the agent’s*chosen actions in the regions of state\-action space that matter for policy optimization\.

L4 \(Action controllability and counterfactual fidelity\)asks whether changing the action causes the*correct task\-relevant change*\. This is explicit in ACEM, PCM, WHALE, dWorldEval, WorldGym, IRASim, and RoboMaster\[[8](https://arxiv.org/html/2606.15032#bib.bib7),[7](https://arxiv.org/html/2606.15032#bib.bib8),[63](https://arxiv.org/html/2606.15032#bib.bib6),[34](https://arxiv.org/html/2606.15032#bib.bib30),[44](https://arxiv.org/html/2606.15032#bib.bib26),[66](https://arxiv.org/html/2606.15032#bib.bib31),[16](https://arxiv.org/html/2606.15032#bib.bib51)\]\. Typical evidence includes action\-effect tests, end\-effector or object trajectory accuracy,Δ\\Delta\-LPIPS, round\-trip consistency, or counterfactual OPE diagnostics\. L4 is the first level that clearly distinguishes a decision world model from a mere future\-video prior\.

For embodied decision\-making, L4 is, in our view, a strong candidate for the minimal genuinely interventional requirement\. If a model appears insensitive to actions or produces the wrong action\-dependent changes, then strong L0–L3 scores provide limited reassurance about decision usefulness\.

L5 \(Reward, value, and outcome fidelity\)asks whether the model predicts success, reward, progress, constraint violation, or value accurately enough for decision making\. This is explicit in DiWA, ProphRL, RISE, VLAW, GigaBrain, PlayWorld, and VLA\-MBPO\[[5](https://arxiv.org/html/2606.15032#bib.bib35),[62](https://arxiv.org/html/2606.15032#bib.bib38),[56](https://arxiv.org/html/2606.15032#bib.bib42),[21](https://arxiv.org/html/2606.15032#bib.bib43),[19](https://arxiv.org/html/2606.15032#bib.bib44),[60](https://arxiv.org/html/2606.15032#bib.bib47),[64](https://arxiv.org/html/2606.15032#bib.bib48)\]\. Metrics include reward accuracy, precision/recall/F1, success\-probability calibration, value error, or Kendall’s tau for progress ranking\.

L5 matters because a visually imperfect model can still be useful if it preserves the variables that determine reward\. Conversely, a photorealistic simulator that hallucinates success can be risky for policy optimization\. This is one reason to treat reward and value as first\-class evaluation targets rather than as downstream afterthoughts\.

L6 \(Policy evaluation and ranking\)asks whether model\-based evaluation agrees with real or simulator policy performance\. This is the focus of WorldGym, Vid2World, Scalable Policy Evaluation, dWorldEval, DreamDojo, Gemini/Veo, and Persistent Robot World Models\[[44](https://arxiv.org/html/2606.15032#bib.bib26),[25](https://arxiv.org/html/2606.15032#bib.bib27),[47](https://arxiv.org/html/2606.15032#bib.bib28),[34](https://arxiv.org/html/2606.15032#bib.bib30),[17](https://arxiv.org/html/2606.15032#bib.bib32),[18](https://arxiv.org/html/2606.15032#bib.bib29),[3](https://arxiv.org/html/2606.15032#bib.bib34)\]\. Typical metrics are Pearson or Spearman correlation, pairwise ranking accuracy, and MMRV\.

This is one of the strongest*fixed\-policy*criteria in the current literature\. It directly tests whether the model preserves the ordering of policies, which is often more decision\-relevant than image similarity\. But L6 is still weaker than L7, because an optimizer can drive the policy into parts of the space that were never tested by the fixed policy set\.

L7 \(Policy optimization and planning utility\)asks the ultimate pragmatic question: does using the model improve decisions? This includes model\-based planning, model\-based RL, executable video planning, and synthetic\-data\-driven gains\. DayDreamer, ADM\-v2, WMPO, DreamGen, VLP, EVA, and V\-JEPA 2 are representative examples\[[53](https://arxiv.org/html/2606.15032#bib.bib4),[37](https://arxiv.org/html/2606.15032#bib.bib9),[67](https://arxiv.org/html/2606.15032#bib.bib39),[26](https://arxiv.org/html/2606.15032#bib.bib13),[12](https://arxiv.org/html/2606.15032#bib.bib56),[50](https://arxiv.org/html/2606.15032#bib.bib55),[1](https://arxiv.org/html/2606.15032#bib.bib64)\]\.

L7 provides the most direct evidence for world models claimed for embodied decision\-making\. If using the model does not help policy evaluation, planning, or optimization, then the case for it as a decision world model is correspondingly limited\. At the same time, L7 is*entangled*: it depends not only on the world model, but also on the reward model, the optimizer, rollout horizon, uncertainty handling, and the surrounding data pipeline\. This is why L7 is usually easier to interpret when paired with L4–L6 decompositions rather than treated as a standalone number\.

The ladder suggests a useful distinction\.

- •L0–L3 are best interpreted as diagnostic levels\.They evaluate the generated artifact: realism, fidelity, semantics, and physical plausibility\. They are useful and often necessary, especially for video\-based interfaces\.
- •L4 can be viewed as an interventional threshold\.It asks whether the model actually responds correctly to actions\.
- •L5–L7 provide the most direct evidence of decision use\.They test whether the model preserves outcomes, ranks policies, and improves decisions\.

State\-abstraction and latent\-representation diagnostics such as\[[48](https://arxiv.org/html/2606.15032#bib.bib67)\]cut across the ladder rather than forming an extra rung\. They matter chiefly because they help explain success or failure at L4–L7\.

The ladder is therefore better viewed as an evidential hierarchy than as a single score scale\. For decision\-making claims, lower\-level success is not a reliable substitute for higher\-level evidence, even though lower\-level metrics may still be very useful in practice and may correlate with higher\-level performance in some domains\.

## 5A Decision\-Making\-Centric Evaluation Framework

The ladder clarifies what kinds of evidence exist\. We now turn from description to a more explicit proposal\. The points below are best read as*recommendations for models whose stated purpose is embodied decision\-making*, not as universal axioms that every world\-model paper must satisfy\.

#### A declared decision contract clarifies evaluation\.

There is no single use\-independent world\-model score\. In our view, evaluation becomes much easier to interpret once a*decision contract*is declared:

> *This model is a world model for task family𝒯\\mathcal\{T\}, policy classΠ\\Pi, action interface𝒜\\mathcal\{A\}, horizonHH, and decision useUU\.*

Without𝒯\\mathcal\{T\},Π\\Pi,HH, andUU, it is difficult to know whether FVD, VLM\-based physics QA, policy ranking, or final task success is the most relevant primary metric\.

This observation explains a large part of the survey\. Many benchmark papers are perfectly reasonable once interpreted as L0–L3 contracts\. The problem arises when those results are generalized to stronger L6–L7 claims without additional evidence\.

#### Counterfactual action fidelity is often the first distinguishing requirement\.

A passive video predictor estimates likely futures under the data distribution\. A decision world model is more compelling when it can answer interventional queries\. It needs to distinguish

ℙℰ\(ot\+1:t\+H∣ht\)\\mathbb\{P\}\_\{\\mathcal\{E\}\}\(o\_\{t\+1:t\+H\}\\mid h\_\{t\}\)from

ℙℰ\(ot\+1:t\+H∣ht,do\(at:t\+H−1\)\)\.\\mathbb\{P\}\_\{\\mathcal\{E\}\}\(o\_\{t\+1:t\+H\}\\mid h\_\{t\},do\(a\_\{t:t\+H\-1\}\)\)\.The second object is the one needed for planning and policy optimization\.

For a task\-relevant feature mapΦ\\Phi, define the interventional prediction error

IPEH,Φ\(ℰ^\)=𝔼\(h,a0:H−1\)∼𝒬\[dΦ\(Φ\(τ^1:H\),Φ\(τ1:H\)\)\],\\mathrm\{IPE\}\_\{H,\\Phi\}\(\\widehat\{\\mathcal\{E\}\}\)=\\mathbb\{E\}\_\{\(h,a\_\{0:H\-1\}\)\\sim\\mathcal\{Q\}\}\\left\[d\_\{\\Phi\}\\left\(\\Phi\(\\widehat\{\\tau\}\_\{1:H\}\),\\Phi\(\\tau\_\{1:H\}\)\\right\)\\right\],whereτ^∼ℰ^\(⋅∣h,a0:H−1\)\\widehat\{\\tau\}\\sim\\widehat\{\\mathcal\{E\}\}\(\\cdot\\mid h,a\_\{0:H\-1\}\)andτ∼ℰ\(⋅∣h,do\(a0:H−1\)\)\\tau\\sim\\mathcal\{E\}\(\\cdot\\mid h,do\(a\_\{0:H\-1\}\)\)\. The feature mapΦ\\Phimay include object poses, end\-effector states, contact events, success predicates, safety constraints, or latent task variables, depending on the domain\.

A complementary action\-effect metric compares two actions from the same history:

Δℰ=dΦ\(Φ\(τa\),Φ\(τa′\)\),Δℰ^=dΦ\(Φ\(τ^a\),Φ\(τ^a′\)\)\.\\Delta\_\{\\mathcal\{E\}\}=d\_\{\\Phi\}\\left\(\\Phi\(\\tau^\{a\}\),\\Phi\(\\tau^\{a^\{\\prime\}\}\)\\right\),\\qquad\\Delta\_\{\\widehat\{\\mathcal\{E\}\}\}=d\_\{\\Phi\}\\left\(\\Phi\(\\widehat\{\\tau\}^\{a\}\),\\Phi\(\\widehat\{\\tau\}^\{a^\{\\prime\}\}\)\\right\)\.A model that preserves the magnitude and ordering of action\-induced differences offers stronger evidence of decision relevance than one that merely predicts plausible futures\.

#### Policy\-induced distribution shift is usually worth including\.

Behavior\-policy data and target\-policy rollouts generally come from different state\-action distributions\. Letdℰπ\(s,a\)d\_\{\\mathcal\{E\}\}^\{\\pi\}\(s,a\)denote the discounted occupancy measure of policyπ\\piin the real environment, anddℰ^πd\_\{\\widehat\{\\mathcal\{E\}\}\}^\{\\pi\}the analogous occupancy under the model\. A model may have low error underdℰμd\_\{\\mathcal\{E\}\}^\{\\mu\}, whereμ\\muis the behavior policy, but high error underdℰπd\_\{\\mathcal\{E\}\}^\{\\pi\}, whereπ\\piis the target or optimized policy\.

This is why policy\-conditioned models, counterfactual environment\-model learning, and full\-horizon dynamics models are especially relevant\[[8](https://arxiv.org/html/2606.15032#bib.bib7),[7](https://arxiv.org/html/2606.15032#bib.bib8),[37](https://arxiv.org/html/2606.15032#bib.bib9)\]\. A decision\-centric benchmark is often more informative when it includes target policies that differ from the data\-collection policies, ideally including policies produced by the model\-based optimizer itself\.

#### For many decision uses, full\-horizon outcome fidelity is more informative than short\-horizon reconstruction\.

One\-step or short\-horizon accuracy can be misleading\. Small local errors may compound, while visually large errors may be irrelevant to reward\. For many decision uses, it is therefore more informative to evaluate the world model in terms of full\-horizon task\-relevant outcomes:

J^ℰ^\(π\)=𝔼τ^∼ℰ^,π\[∑t=0H−1γtr^t\]\.\\widehat\{J\}\_\{\\widehat\{\\mathcal\{E\}\}\}\(\\pi\)=\\mathbb\{E\}\_\{\\widehat\{\\tau\}\\sim\\widehat\{\\mathcal\{E\}\},\\pi\}\\left\[\\sum\_\{t=0\}^\{H\-1\}\\gamma^\{t\}\\widehat\{r\}\_\{t\}\\right\]\.
For a policy setΠeval\\Pi\_\{\\mathrm\{eval\}\}, define full\-horizon value error

FVE\(ℰ^,Πeval\)=1\|Πeval\|∑π∈Πeval\|J^ℰ^\(π\)−Jℰ\(π\)\|\.\\mathrm\{FVE\}\(\\widehat\{\\mathcal\{E\}\},\\Pi\_\{\\mathrm\{eval\}\}\)=\\frac\{1\}\{\|\\Pi\_\{\\mathrm\{eval\}\}\|\}\\sum\_\{\\pi\\in\\Pi\_\{\\mathrm\{eval\}\}\}\\left\|\\widehat\{J\}\_\{\\widehat\{\\mathcal\{E\}\}\}\(\\pi\)\-J\_\{\\mathcal\{E\}\}\(\\pi\)\\right\|\.For sparse\-success tasks, success\-probability calibration is equally useful:

p^ℰ^\(π\)=ℙτ^∼ℰ^,π\[τ^succeeds\],pℰ\(π\)=ℙτ∼ℰ,π\[τsucceeds\]\.\\widehat\{p\}\_\{\\widehat\{\\mathcal\{E\}\}\}\(\\pi\)=\\mathbb\{P\}\_\{\\widehat\{\\tau\}\\sim\\widehat\{\\mathcal\{E\}\},\\pi\}\\left\[\\widehat\{\\tau\}\\ \\mathrm\{succeeds\}\\right\],\\qquad p\_\{\\mathcal\{E\}\}\(\\pi\)=\\mathbb\{P\}\_\{\\tau\\sim\\mathcal\{E\},\\pi\}\\left\[\\tau\\ \\mathrm\{succeeds\}\\right\]\.
The central point is simple: when the model is used to compare or optimize policies, it is often more important that it preserve what the policy is trying to optimize than that it preserve every visual detail\.

#### Closed\-loop rollout and occupancy fidelity can be more informative than teacher\-forced prediction\.

A world model used by a policy is often more meaningfully evaluated*closed\-loop*: the policy acts on model\-generated histories, not only on teacher\-forced ground\-truth prefixes\. One useful target is occupancy mismatch:

Docc\(ℰ^,π\)=∑t=0HγtD\(dℰ,tπ,dℰ^,tπ\),D\_\{\\mathrm\{occ\}\}\(\\widehat\{\\mathcal\{E\}\},\\pi\)=\\sum\_\{t=0\}^\{H\}\\gamma^\{t\}D\\\!\\left\(d\_\{\\mathcal\{E\},t\}^\{\\pi\},d\_\{\\widehat\{\\mathcal\{E\}\},t\}^\{\\pi\}\\right\),whereDDmay be MMD, Wasserstein distance, KL divergence, total variation, or a task\-specific discrepancy in feature space\. In many real\-robot settings exact densities are unavailable, so practical approximations can use embedded trajectory distributions, object\-pose histograms, or learned task\-state features\.

This is a key distinction from many open\-loop video evaluations\. A world model that remains stable only while conditioned on ground\-truth context may still be useful for some applications, but it offers weaker evidence as a closed\-loop simulator\.

#### Policy ranking can often be measured directly\.

For policy evaluation, exact value may be less important than choosing the better policy\. LetΠeval=\{π1,…,πn\}\\Pi\_\{\\mathrm\{eval\}\}=\\\{\\pi\_\{1\},\\ldots,\\pi\_\{n\}\\\}\. Pairwise ranking accuracy is

PRA=1n\(n−1\)∑i≠j𝟏\[\(J^ℰ^\(πi\)−J^ℰ^\(πj\)\)\(Jℰ\(πi\)−Jℰ\(πj\)\)\>0\]\.\\mathrm\{PRA\}=\\frac\{1\}\{n\(n\-1\)\}\\sum\_\{i\\neq j\}\\mathbf\{1\}\\left\[\\left\(\\widehat\{J\}\_\{\\widehat\{\\mathcal\{E\}\}\}\(\\pi\_\{i\}\)\-\\widehat\{J\}\_\{\\widehat\{\\mathcal\{E\}\}\}\(\\pi\_\{j\}\)\\right\)\\left\(J\_\{\\mathcal\{E\}\}\(\\pi\_\{i\}\)\-J\_\{\\mathcal\{E\}\}\(\\pi\_\{j\}\)\\right\)\>0\\right\]\.Mean maximum rank violation is

MMRV=1n∑i=1nmaxj:Jℰ\(πi\)\>Jℰ\(πj\)\[rankℰ^\(πi\)−rankℰ^\(πj\)\]\+\.\\mathrm\{MMRV\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\max\_\{j:J\_\{\\mathcal\{E\}\}\(\\pi\_\{i\}\)\>J\_\{\\mathcal\{E\}\}\(\\pi\_\{j\}\)\}\\left\[\\operatorname\{rank\}\_\{\\widehat\{\\mathcal\{E\}\}\}\(\\pi\_\{i\}\)\-\\operatorname\{rank\}\_\{\\widehat\{\\mathcal\{E\}\}\}\(\\pi\_\{j\}\)\\right\]\_\{\+\}\.These are exactly the kinds of quantities emphasized in the emerging policy\-evaluation literature\[[44](https://arxiv.org/html/2606.15032#bib.bib26),[47](https://arxiv.org/html/2606.15032#bib.bib28),[34](https://arxiv.org/html/2606.15032#bib.bib30)\]\.

#### Optimization utility is easier to interpret alongside exploitability tests\.

If a fixed optimizer𝒜\\mathcal\{A\}uses the world model to produce

π^ℰ^=𝒜\(ℰ^,𝒟,B\),\\widehat\{\\pi\}\_\{\\widehat\{\\mathcal\{E\}\}\}=\\mathcal\{A\}\(\\widehat\{\\mathcal\{E\}\},\\mathcal\{D\},B\),then the primary system\-level quantity is policy lift:

Lift=Jℰ\(π^ℰ^\)−Jℰ\(πbase\)\.\\mathrm\{Lift\}=J\_\{\\mathcal\{E\}\}\(\\widehat\{\\pi\}\_\{\\widehat\{\\mathcal\{E\}\}\}\)\-J\_\{\\mathcal\{E\}\}\(\\pi\_\{\\mathrm\{base\}\}\)\.When an oracle or strong referenceπ⋆\\pi^\{\\star\}is available, define optimization regret:

OptRegret=Jℰ\(π⋆\)−Jℰ\(π^ℰ^\)\.\\mathrm\{OptRegret\}=J\_\{\\mathcal\{E\}\}\(\\pi^\{\\star\}\)\-J\_\{\\mathcal\{E\}\}\(\\widehat\{\\pi\}\_\{\\widehat\{\\mathcal\{E\}\}\}\)\.A complementary exploitability metric is

XGap=J^ℰ^\(π^ℰ^\)−Jℰ\(π^ℰ^\)\.\\mathrm\{XGap\}=\\widehat\{J\}\_\{\\widehat\{\\mathcal\{E\}\}\}\(\\widehat\{\\pi\}\_\{\\widehat\{\\mathcal\{E\}\}\}\)\-J\_\{\\mathcal\{E\}\}\(\\widehat\{\\pi\}\_\{\\widehat\{\\mathcal\{E\}\}\}\)\.A large positive exploitability gap indicates that the optimizer found trajectories that look good to the model but fail in the environment\. This is precisely the kind of failure mode that purely perceptual or fixed\-policy metrics can miss\.

#### Uncertainty and abstention can materially affect evaluation\.

A world model used for optimization is easier to trust if it can indicate when it is unreliable\. Letuℰ^\(h,a0:H−1\)u\_\{\\widehat\{\\mathcal\{E\}\}\}\(h,a\_\{0:H\-1\}\)denote an uncertainty score\. A useful uncertainty measure is ideally calibrated with respect to outcome or value error:

ℙ\(\|J^ℰ^\(π\)−Jℰ\(π\)\|≤ϵ\|uℰ^\(π\)≤α\)≈1−δ\.\\mathbb\{P\}\\left\(\\left\|\\widehat\{J\}\_\{\\widehat\{\\mathcal\{E\}\}\}\(\\pi\)\-J\_\{\\mathcal\{E\}\}\(\\pi\)\\right\|\\leq\\epsilon\\;\\middle\|\\;u\_\{\\widehat\{\\mathcal\{E\}\}\}\(\\pi\)\\leq\\alpha\\right\)\\approx 1\-\\delta\.In practice, this can be reported through risk\-coverage curves, error\-vs\.\-uncertainty correlation, abstention performance, pessimistic\-planning performance, Brier score, andECE\\operatorname\{ECE\}\.

This matters because offline or model\-based optimizers may actively seek overconfident errors\. WHALE’s emphasis on generalization and uncertainty is therefore not merely a side issue; it is highly relevant for decision\-centric evaluation\[[63](https://arxiv.org/html/2606.15032#bib.bib6)\]\.

#### We recommend not letting lower levels compensate for higher\-level failure\.

For decision\-making claims, we recommend caution when aggregating lower\- and higher\-level metrics into a single score\. A model should not rank highly as a decision world model merely because it has good FVD or VLM\-based physical plausibility if it performs poorly at action controllability, reward fidelity, or policy ranking\. This recommendation is motivated not by hostility to lower\-level metrics, but by the observation that they can otherwise dominate the score while answering a different question\.

## 6A Benchmark Protocol

The framework above says*what*may be worth measuring\. This section turns it into a more operational benchmark protocol\. The protocol is best read as a*modular template*, not as a claim that every benchmark release in every domain needs every component at full scale\.

### Step 0: Declare the world\-model contract

We recommend that each submission declare the decision contract in Table[8](https://arxiv.org/html/2606.15032#S6.T8)\. This makes explicit what the model claims to support\.

FieldSuggested declarationTask familyEnvironments, embodiments, observation modalities, task definitions, reward/success specification\.Policy classBC, VLA, diffusion policy, MPC planner, RL policy, human teleoperator, or other deployment policy\.Action interfaceJoint control, end\-effector commands, action chunks, trajectories, language actions, latent actions\.Decision usePrediction, policy evaluation, planning, policy optimization, safety testing, synthetic data, or representation\.Task\-relevant feature mapΦ\\PhiThe variables used to judge action effects and outcomes: object poses, contacts, safety predicates, progress variables, latent task state, etc\.HorizonOne\-step, short\-horizon, full\-episode, or planning horizon\.Deployment regimeOffline, online, simulator, real robot, real\-to\-sim, or mixed\.Allowed supervisionPixels, states, rewards, language, videos, actions, demonstrations, human labels, VLM labels\.Uncertainty interfaceEnsemble variance, confidence, likelihood, pessimistic bound, or none\.Table 8:A world\-model claim becomes much easier to interpret once a decision contract is declared\.
### Step 1: Construct policy and intervention splits

A decision\-centric benchmark is often more informative if it does not evaluate only behavior\-policy rollouts\. One useful decomposition is into four policy sets:

1. 1\.Πbeh\\Pi\_\{\\mathrm\{beh\}\}: the behavior policies that generated the training data;
2. 2\.Πanchor\\Pi\_\{\\mathrm\{anchor\}\}: fixed anchor policies spanning weak, medium, and strong performance;
3. 3\.Πshift\\Pi\_\{\\mathrm\{shift\}\}: target policies that induce distribution shift relative to the data;
4. 4\.Πopt\\Pi\_\{\\mathrm\{opt\}\}: policies produced by optimizing inside the world model\.

The benchmark may also include an intervention set𝒬\\mathcal\{Q\}of matched histories and action branches\. In simulation, this can be done exactly by resetting to the same state\. On real robots, one may approximate this with controlled tabletop resets, real\-to\-sim reconstruction, matched trajectory segments, or carefully designed branchable tasks\.

### Step 2: Report L0–L3 diagnostics as diagnostics

Open\-loop diagnostic reporting remains useful, especially for video\-based models:

- •L0: visual realism, aesthetics, image quality, human preference;
- •L1: MSE, PSNR, SSIM, LPIPS, DreamSim, latent L2, temporal metrics;
- •L2: instruction\-following, caption similarity, semantic alignment, VLM task completion;
- •L3: object permanence, non\-penetration, contact, depth, 3D consistency, physical QA\.

These metrics are scientifically useful for debugging failure modes and for understanding interfaces\. But for a decision benchmark they are often best reported as auxiliary diagnostics rather than as the primary score\.

### Step 3: Evaluate interventional action fidelity

For each\(h,a0:H−1\)∈𝒬\(h,a\_\{0:H\-1\}\)\\in\\mathcal\{Q\}, one can evaluate interventional prediction errorIPEH,Φ\\mathrm\{IPE\}\_\{H,\\Phi\}and action\-effect agreement\. Useful components include:

- •matched\-action branching from the same history;
- •single\-dimension action sweeps where meaningful;
- •round\-trip or reversibility tests when applicable;
- •robot/object trajectory fidelity for action\-conditioned models;
- •contact/event prediction for manipulation tasks\.

This step is the operational realization of L4\. If a model performs poorly here, we would hesitate to regard it as strong evidence of a decision world model, regardless of its L0–L3 scores\.

### Step 4: Evaluate closed\-loop fixed\-policy rollouts

For policies inΠanchor∪Πshift\\Pi\_\{\\mathrm\{anchor\}\}\\cup\\Pi\_\{\\mathrm\{shift\}\}, one can roll them out both in the real environment and inside the world model\. Useful outputs include:

- •closed\-loop rollout fidelity or occupancy mismatch;
- •full\-horizon value error;
- •success\-probability calibration;
- •reward/progress prediction;
- •Pearson/Spearman correlation, pairwise ranking accuracy, and MMRV\.

This step operationalizes L5 and L6\. In our view, it is especially informative when the rollouts are*closed\-loop*; teacher\-forced prefix conditioning alone is often insufficient\.

### Step 5: Evaluate policy optimization under a fixed budget

A benchmark for decision world models may also include a model\-based optimization challenge\. Fix an optimizer𝒜\\mathcal\{A\}, a data regime, a compute budget, and if applicable an interaction budget\. Depending on the declared use,𝒜\\mathcal\{A\}may be MPC, CEM, imagined RL, synthetic\-data filtering, or another planner/optimizer\.

The benchmark can then evaluate the optimized policy in the real environment or a trusted simulator, reporting:

- •policy lift relative to a fixed baseline;
- •optimization regret relative to a strong reference when available;
- •sample efficiency and compute cost;
- •safe\-improvement probability and constraint\-violation rate\.

This is the operational realization of L7\. It provides some of the strongest evidence that the world model actually supports better decisions, though it remains an end\-to\-end system metric\.

### Step 6: Adversarial exploitability and uncertainty

A stronger decision benchmark may also test the model under*adversarial use*\. Search for action sequences or policies that maximize predicted value under the model, subject to action constraints and safety filters\. Execute a safe subset in the real environment or a trusted simulator\. Report the exploitability gap and failure rate\.

If the model provides uncertainty, one can also evaluate whether abstaining on high\-uncertainty rollouts improves calibration and optimization safety\. A model with calibrated abstention may be more useful than one with superficially better raw prediction but miscalibrated confidence\.

### Step 7: Hidden tasks, held\-out policies, and statistical reporting

To reduce benchmark overfitting, it is useful to:

- •keep some tasks or objects hidden until final evaluation;
- •evaluate on held\-out policy families, not just small perturbations of the same policy class;
- •report confidence intervals over tasks, seeds, initial states, and policy sets;
- •publish per\-task metrics, not only averages\.

### Step 8: Report a decision\-utility profile, not only a scalar

We recommend reporting a profile

\(S0,S1,S2,S3,S4,S5,S6,S7\)\(S\_\{0\},S\_\{1\},S\_\{2\},S\_\{3\},S\_\{4\},S\_\{5\},S\_\{6\},S\_\{7\}\)rather than only a single averaged number\. If a leaderboard requires a scalar, one option is to use gates:

SDC=G4G5G6\(w4S4\+w5S5\+w6S6\+w7S7\),S\_\{\\mathrm\{DC\}\}=G\_\{4\}G\_\{5\}G\_\{6\}\\left\(w\_\{4\}S\_\{4\}\+w\_\{5\}S\_\{5\}\+w\_\{6\}S\_\{6\}\+w\_\{7\}S\_\{7\}\\right\),whereG4,G5,G6∈\{0,1\}G\_\{4\},G\_\{5\},G\_\{6\}\\in\\\{0,1\\\}are pass/fail gates for action controllability, outcome fidelity, and policy\-evaluation validity\. For policy\-optimization claims, an additional gate at L7 may also be appropriate\. L0–L3 can still be reported, but we would avoid letting them compensate for failure at L4–L7\.

### The world\-model evaluation card

We highly recommend that every paper claiming a decision world model could include an evaluation card such as Table[9](https://arxiv.org/html/2606.15032#S6.T9)\.

QuestionWhat to reportWhat is the claimed use?Prediction, policy evaluation, planning, policy optimization, safety testing, synthetic data, or representation\.What ladder levels are actually evaluated?Explicitly state whether the paper provides evidence at L0, L1, …, L7\.What is the action interface?Low\-level control, joint action, end\-effector command, action chunk, trajectory, language action, or latent action\.What policy class is being evaluated or improved?BC, VLA, diffusion policy, MPC planner, RL policy, human teleoperator, or optimized policy\.What distribution shift is tested?Behavior\-to\-target, task, object, environment, embodiment, visual, or optimization\-induced shift\.What horizon is evaluated?One\-step, short\-horizon, full\-episode, or planning horizon, including degradation curves when applicable\.What counterfactual data is used?Reset\-matched branches, simulator interventions, action sweeps, round\-trip tests, adversarial policies, or none\.Are rewards and outcomes evaluated directly?Reward accuracy, value error, success calibration, progress ranking, safety\-constraint prediction\.Does model\-based policy evaluation match reality?Pearson/Spearman correlation, pairwise ranking accuracy, MMRV, calibration, confidence intervals\.Does model\-based optimization improve reality?Policy lift, optimization regret, sample efficiency, exploitability gap, safe\-improvement probability\.Can the model be exploited?Predicted\-vs\.\-real gap for policies or action sequences optimized against the model\.Is uncertainty calibrated?Risk\-coverage curves, error\-vs\.\-uncertainty correlation, abstention performance, pessimistic\-planning results\.Table 9:A suggested reporting card for decision\-making\-centric world\-model evaluation\.

## 7Discussion and Conclusion

### 7\.1Why this distinction matters

If the community ranks world models primarily by L0–L3, it may optimize for the wrong target\. We may get increasingly beautiful videos that remain strategically unreliable\. In robotics, autonomous driving, and embodied agents, this is not a cosmetic problem\. A misleading world model can contribute to unsafe policy updates, misrank candidate policies, or create false confidence about robustness under distribution shift\.

We do not argue that video metrics are useless\. They are useful for debugging, visualization, synthetic\-data filtering, and human interpretability\. They are often necessary for video\-based interfaces\. The concern is treating them as if they were final evidence of decision usefulness\.

The same is true of VLM judges\. They are becoming central because they provide scalable semantic and physical evaluation\[[43](https://arxiv.org/html/2606.15032#bib.bib11),[32](https://arxiv.org/html/2606.15032#bib.bib14),[42](https://arxiv.org/html/2606.15032#bib.bib15),[45](https://arxiv.org/html/2606.15032#bib.bib22)\]\. But they should not be treated as ground truth for policy utility\. A VLM may reward plausible\-looking outcomes while missing subtle action\-relevant errors\. Therefore VLM\-based judgments are easier to interpret when validated against executable outcomes, policy rankings, and real or trusted\-sim performance\.

### 7\.2Common objections and scope conditions

#### Objection 1: not every world model is for control\.

We agree\. Some systems are best understood as future\-video predictors, world generators, or representation learners\. Our argument is conditional: when the stated use is embodied decision\-making, action\-, outcome\-, and policy\-level evidence becomes especially informative\.

#### Objection 2: lower\-level metrics may correlate with higher\-level utility\.

This can absolutely happen\. In some domains, better L0–L3 performance may be a very good proxy for L6–L7 performance\. Our objection is not to using proxies when they are empirically validated; it is to assuming that such correlations hold a priori\.

#### Objection 3: end\-to\-end policy success may already be enough\.

For some engineering purposes, that may be true\. If the question is simply which system performs better under a fixed stack, end\-to\-end success is compelling\. Our interest here is more scientific: when a paper makes a claim about a*world model*, it is helpful to know whether the gain comes from better counterfactual modeling, better reward modeling, better filtering, or some other component\.

#### Objection 4: full interventional evaluation is impractical\.

This is also true, especially on real robots\. Our protocol is therefore best read as modular and aspirational\. Exact reset\-matched branches may be available only in simulation; partial approximations may still be useful in real\-world domains\.

### 7\.3Limitations and open problems

#### Resettable counterfactual data is hard\.

Exact interventional evaluation requires resetting the environment to the same state and executing different actions\. This is easy in simulators, difficult on real robots, and sometimes impossible in open\-world settings\. Real\-to\-sim reconstruction, branchable tabletop tasks, and matched trajectory segments are useful approximations but not perfect\.

#### Task\-relevant feature maps are domain dependent\.

The feature mapΦ\\Phiused in interventional error cannot be universal\. Manipulation needs object poses and contacts; driving needs lane position and safety constraints; navigation needs map and goal progress; games need latent state variables\. This is not a flaw of the framework\. It is a consequence of taking intended use seriously\.

#### Optimization benchmarks can themselves be gamed\.

If the optimization challenge is public and static, methods may overfit it\. Hidden tasks, hidden policies, held\-out embodiments, and adversarial exploitability tests are therefore valuable\.

#### Latent models need special treatment\.

Latent predictive models should not be penalized for lacking photorealistic decoders\. They are often better judged by planning, probing, outcome prediction, and state\-abstraction diagnostics in their own representational space\.

#### Safety requires worst\-case evaluation\.

Average policy lift is not enough for safety\-critical domains\. Constraint\-violation rates, uncertainty calibration, adversarial stress tests, and worst\-case failures are likely to matter\.

### 7\.4Conclusion

The central question is not simply “Can the model generate a realistic future video?” but “In what sense does the model support better decisions?” For embodied decision\-making, we argue that the most informative evidence comes from whether the model preserves the counterfactual, long\-horizon, reward\-relevant structure needed for policy evaluation, planning, and policy optimization\. Visual fidelity, semantic alignment, and physical plausibility remain valuable diagnostics, but they do not by themselves settle that stronger claim\.

The survey in this paper suggests that the field has been evaluating several different objects under one name\. The L0–L7 ladder helps separate these objects\. The proposed decision\-making\-centric framework and benchmark protocol then make explicit what kinds of evidence are most directly relevant to stronger decision\-making claims\. Our position can therefore be stated as follows:

> *For models whose stated purpose is embodied decision\-making, the strongest evidence for the label “world model” is that they enable reliable counterfactual evaluation and, in favorable cases, improvement of policies under intervention and distribution shift\. Other evaluations remain useful, but they play a more auxiliary role for that particular claim\.*

## References

- \[1\]M\. Assran, A\. Bardes, D\. Fan, Q\. Garrido, R\. Howes, M\. Komeili, M\. Muckley, A\. Rizvi, C\. Roberts, K\. Sinha, A\. Zholus, S\. Arnaud, A\. Gejji, A\. Martin, F\. R\. Hogan, D\. Dugas, P\. Bojanowski, V\. Khalidov, P\. Labatut, F\. Massa, M\. Szafraniec, K\. Krishnakumar, Y\. Li, X\. Ma, S\. Chandar, F\. Meier, Y\. LeCun, M\. Rabbat, and N\. Ballas\(2025\)V\-JEPA 2: self\-supervised video models enable understanding, prediction and planning\.Note:arXiv:2506\.09985External Links:[Link](https://arxiv.org/abs/2506.09985)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p5.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.6.4.1.1),[§3\.4](https://arxiv.org/html/2606.15032#S3.SS4.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px4.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 5](https://arxiv.org/html/2606.15032#S3.T5.1.5.4.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p17.1)\.
- \[2\]L\. Barcellona, A\. Zadaianchuk, D\. Allegro, S\. Papa, S\. Ghidoni, and E\. Gavves\(2025\)Dream to manipulate: compositional world models empowering robot imitation learning with imagination\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2412.14957)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p6.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.7.4.1.1),[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.11.9.1.1.1)\.
- \[3\]J\. Bardhan, P\. Drozdik, J\. Sivic, and V\. Petrik\(2026\)Persistent robot world models: stabilizing multi\-step rollouts via reinforcement learning\.Note:arXiv:2603\.25685External Links:[Link](https://arxiv.org/abs/2603.25685)Cited by:[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.10.8.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p15.1),[§4](https://arxiv.org/html/2606.15032#S4.p5.1)\.
- \[4\]F\. Bordes, Q\. Garrido, J\. T\. Kao, A\. Williams, M\. Rabbat, and E\. Dupoux\(2025\)IntPhys 2: benchmarking intuitive physics understanding in complex synthetic environments\.Note:arXiv:2506\.09849External Links:[Link](https://arxiv.org/abs/2506.09849)Cited by:[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.16.15.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p9.1)\.
- \[5\]A\. L\. Chandra, I\. Nematollahi, C\. Huang, T\. Welschehold, W\. Burgard, and A\. Valada\(2025\)DiWA: diffusion policy adaptation with world models\.Note:arXiv:2508\.03645External Links:[Link](https://arxiv.org/abs/2508.03645)Cited by:[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.8.7.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p13.1)\.
- \[6\]D\. Chen, W\. Chung, Y\. Bang, Z\. Ji, and P\. Fung\(2025\)WorldPrediction: a benchmark for high\-level world modeling and long\-horizon procedural planning\.InICML World Models Workshop,External Links:[Link](https://openreview.net/forum?id=3GuGN0bacr)Cited by:[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.12.11.1.1.1)\.
- \[7\]R\. Chen, X\. Chen, Y\. Sun, S\. Xiao, M\. Li, and Y\. Yu\(2024\)Policy\-conditioned environment models are more generalizable\.InInternational Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=g9mYBdooPA)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p6.1),[§3\.2](https://arxiv.org/html/2606.15032#S3.SS2.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.5.4.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p11.1),[§5](https://arxiv.org/html/2606.15032#S5.SS0.SSS0.Px3.p2.1)\.
- \[8\]X\. Chen, Y\. Yu, Z\. Zhu, Z\. Yu, Z\. Chen, C\. Wang, Y\. Wu, H\. Wu, R\. Qin, R\. Ding, and F\. Huang\(2023\)Adversarial counterfactual environment model learning\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=rHAX0LRwk8)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p6.1),[§3\.2](https://arxiv.org/html/2606.15032#S3.SS2.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.3.2.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p11.1),[§5](https://arxiv.org/html/2606.15032#S5.SS0.SSS0.Px3.p2.1)\.
- \[9\]Y\. Chen, R\. Chen, D\. Huo, Y\. Yang, D\. Qi, H\. Liu, T\. Lin, S\. Zeng, J\. Xiao, X\. Chang, F\. Xiong, X\. Wei, Z\. Ma, and M\. Xu\(2026\)ABot\-PhysWorld: interactive world foundation model for robotic manipulation with physics alignment\.Note:arXiv:2603\.23376External Links:[Link](https://arxiv.org/abs/2603.23376)Cited by:[Table 5](https://arxiv.org/html/2606.15032#S3.T5.1.3.2.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p3.1)\.
- \[10\]X\. Chi, H\. Zhang, C\. Fan, X\. Qi, R\. Zhang, A\. Chen, C\. Chan, W\. Xue, W\. Luo, S\. Zhang, and Y\. Guo\(2024\)EVA: an embodied world model for future video anticipation\.Note:arXiv:2410\.15461External Links:[Link](https://arxiv.org/abs/2410.15461)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p3.1),[§2\.2](https://arxiv.org/html/2606.15032#S2.SS2.SSS0.Px2.p5.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.4.4.1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.2.1.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p7.1)\.
- \[11\]Y\. Deng, Z\. Pan, H\. Zhang, X\. Li, R\. Hu, Y\. Ding, Y\. Zou, Y\. Zeng, and D\. Zhou\(2026\)Rethinking video generation model for the embodied world\.Note:arXiv:2601\.15282External Links:[Link](https://arxiv.org/abs/2601.15282)Cited by:[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.7.6.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p7.1)\.
- \[12\]Y\. Du, M\. Yang, P\. Florence, F\. Xia, A\. Wahid, B\. Ichter, P\. Sermanet, T\. Yu, P\. Abbeel, J\. B\. Tenenbaum, L\. Kaelbling, A\. Zeng, and J\. Tompson\(2024\)Video language planning\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2310.10625)Cited by:[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.18.16.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p17.1)\.
- \[13\]H\. Duan, H\. Yu, S\. Chen, F\. Li, and J\. Wu\(2025\)WorldScore: a unified evaluation benchmark for world generation\.InInternational Conference on Computer Vision,External Links:[Link](https://arxiv.org/abs/2504.00983)Cited by:[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.10.9.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p3.1)\.
- \[14\]C\. Fan, X\. Chi, X\. Ju, H\. Li, Y\. Bao, Y\. Wang, L\. Chen, Z\. Jiang, K\. Ge, Y\. Li, W\. Mi, Q\. Wuwu, P\. Jia, Y\. Luo, K\. Zhang, Z\. Qin, Y\. Dai, S\. Han, Y\. Guo, S\. Zhang, and J\. Tang\(2026\)Wow, wo, val\! a comprehensive embodied world model evaluation turing test\.Note:arXiv:2601\.04137External Links:[Link](https://arxiv.org/abs/2601.04137)Cited by:[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.6.5.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p7.1)\.
- \[15\]A\. Foss, C\. Evans, S\. Mitts, K\. Sinha, A\. Rizvi, and J\. T\. Kao\(2025\)CausalVQA: a physically grounded causal reasoning benchmark for video models\.Note:arXiv:2506\.09943External Links:[Link](https://arxiv.org/abs/2506.09943)Cited by:[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.17.16.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p9.1)\.
- \[16\]X\. Fu, X\. Wang, X\. Liu, J\. Bai, R\. Xu, P\. Wan, D\. Zhang, and D\. Lin\(2026\)Learning video generation for robotic manipulation with collaborative trajectory control\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=OeDwYtp8n1)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p6.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.7.4.1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.14.12.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p11.1)\.
- \[17\]S\. Gao, W\. Liang, K\. Zheng, A\. Malik, S\. Ye, S\. Yu, W\. Tseng, Y\. Dong, K\. Mo, C\. Lin, Q\. Ma, S\. Nah, L\. Magne, J\. Xiang, Y\. Xie, R\. Zheng, D\. Niu, Y\. L\. Tan, K\. R\. Zentner, G\. Kurian, S\. Indupuru, P\. Jannaty, J\. Gu, J\. Zhang, J\. Malik, P\. Abbeel, M\. Liu, Y\. Zhu, J\. Jang, and L\. Fan\(2026\)DreamDojo: a generalist robot world model from large\-scale human videos\.Note:arXiv:2602\.06949External Links:[Link](https://arxiv.org/abs/2602.06949)Cited by:[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.8.6.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p15.1)\.
- \[18\]Gemini Robotics Team, K\. Choromanski, C\. Devin, Y\. Du, D\. Dwibedi, R\. Gao, A\. Jindal, T\. Kipf, S\. Kirmani, I\. Leal, F\. Liu, A\. Majumdar, A\. Marmon, C\. Parada, Y\. Rubanova, D\. Shah, V\. Sindhwani, J\. Tan, F\. Xia, T\. Xiao, S\. Yang, W\. Yu, and A\. Zhou\(2025\)Evaluating gemini robotics policies in a veo world simulator\.Note:arXiv:2512\.10675External Links:[Link](https://arxiv.org/abs/2512.10675)Cited by:[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.6.4.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p15.1)\.
- \[19\]GigaBrain Team, B\. Wang, C\. Ni, G\. Huang, G\. Zhao, H\. Li, J\. Li, J\. Lv, J\. Liu, L\. Feng, M\. Yu, P\. Li, Q\. Deng, T\. Liu, X\. Zhou, X\. Chen, X\. Wang, Y\. Wang, Y\. Li, Y\. Nie, Y\. Li, Y\. Zhou, Y\. Ye, Z\. Liu, and Z\. Zhu\(2026\)GigaBrain\-0\.5M\*: a VLA that learns from world model\-based reinforcement learning\.Note:arXiv:2602\.12099External Links:[Link](https://arxiv.org/abs/2602.12099)Cited by:[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.17.16.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p13.1)\.
- \[20\]GigaWorld Team, A\. Ye, B\. Wang, C\. Ni, G\. Huang, G\. Zhao, H\. Li, J\. Zhu, K\. Li, M\. Xu, Q\. Deng, S\. Wang, W\. Qin, X\. Chen, X\. Wang, Y\. Wang, Y\. Cao, Y\. Chang, Y\. Xu, Y\. Ye, Y\. Wang, Y\. Zhou, Z\. Zhang, Z\. Dong, and Z\. Zhu\(2025\)GigaWorld\-0: world models as data engine to empower embodied AI\.Note:arXiv:2511\.19861External Links:[Link](https://arxiv.org/abs/2511.19861)Cited by:[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.15.13.1.1.1)\.
- \[21\]Y\. Guo, T\. Lee, L\. X\. Shi, J\. Chen, P\. Liang, and C\. Finn\(2026\)VLAW: iterative co\-improvement of vision\-language\-action policy and world model\.Note:arXiv:2602\.12063External Links:[Link](https://arxiv.org/abs/2602.12063)Cited by:[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.16.15.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p13.1)\.
- \[22\]Y\. Guo, L\. X\. Shi, J\. Chen, and C\. Finn\(2026\)Ctrl\-World: a controllable generative world model for robot manipulation\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2510.10125)Cited by:[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.13.11.1.1.1)\.
- \[23\]D\. Ha and J\. Schmidhuber\(2018\)World models\.Note:arXiv:1803\.10122External Links:[Link](https://arxiv.org/abs/1803.10122)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p1.2),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.2.4.1.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.3.4.1.1)\.
- \[24\]D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap\(2023\)Mastering diverse domains through world models\.Note:arXiv:2301\.04104External Links:[Link](https://arxiv.org/abs/2301.04104)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p1.2),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.2.4.1.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.3.4.1.1)\.
- \[25\]S\. Huang, J\. Wu, Q\. Zhou, S\. Miao, and M\. Long\(2026\)Vid2World: crafting video diffusion models to interactive world models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=pFyzqbUiF9)Cited by:[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p4.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.5.4.1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.4.2.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p15.1)\.
- \[26\]J\. Jang, S\. Ye, Z\. Lin, J\. Xiang, J\. Bjorck, Y\. Fang, F\. Hu, S\. Huang, K\. Kundalia, Y\. Lin, L\. Magne, A\. Mandlekar, A\. Narayan, Y\. L\. Tan, G\. Wang, J\. Wang, Q\. Wang, Y\. Xu, X\. Zeng, K\. Zheng, R\. Zheng, M\. Liu, L\. Zettlemoyer, D\. Fox, J\. Kautz, S\. Reed, Y\. Zhu, and L\. Fan\(2025\)DreamGen: unlocking generalization in robot learning through neural trajectories\.InConference on Robot Learning,External Links:[Link](https://arxiv.org/abs/2505.12705)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p6.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.7.4.1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.5.4.1.1.1),[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.12.10.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p17.1),[§4](https://arxiv.org/html/2606.15032#S4.p7.1)\.
- \[27\]M\. Janner, J\. Fu, M\. Zhang, and S\. Levine\(2019\)When to trust your model: model\-based policy optimization\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/1906.08253)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p1.2),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.2.4.1.1)\.
- \[28\]F\. Jiang, Y\. Chen, K\. Xu, Y\. Liu, H\. Wang, Z\. Shen, J\. Lu, S\. Huang, Y\. Wang, C\. Xie, and R\. Wu\(2026\)RoboWM\-Bench: a benchmark for evaluating world models in robotic manipulation\.Note:arXiv:2604\.19092External Links:[Link](https://arxiv.org/abs/2604.19092)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§1](https://arxiv.org/html/2606.15032#S1.p6.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p6.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.7.4.1.1),[§3\.1](https://arxiv.org/html/2606.15032#S3.SS1.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px4.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.9.8.1.1.1)\.
- \[29\]Z\. Jiang, K\. Liu, Y\. Qin, S\. Tian, Y\. Zheng, M\. Zhou, C\. Yu, H\. Li, and D\. Zhao\(2025\)World4RL: diffusion world models for policy refinement with reinforcement learning for robotic manipulation\.Note:arXiv:2509\.19080External Links:[Link](https://arxiv.org/abs/2509.19080)Cited by:[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.9.8.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p5.1)\.
- \[30\]Z\. Jiang, S\. Zhou, Y\. Jiang, Z\. Huang, M\. Wei, Y\. Chen, T\. Zhou, Z\. Guo, H\. Lin, Q\. Zhang, Y\. Wang, H\. Li, C\. Yu, and D\. Zhao\(2026\)WoVR: world models as reliable simulators for post\-training VLA policies with RL\.Note:arXiv:2602\.13977External Links:[Link](https://arxiv.org/abs/2602.13977)Cited by:[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.18.17.1.1.1)\.
- \[31\]B\. Krojer, M\. Komeili, C\. Ross, Q\. Garrido, K\. Sinha, N\. Ballas, and M\. Assran\(2025\)A shortcut\-aware video\-QA benchmark for physical understanding via minimal video pairs\.Transactions on Machine Learning Research\.External Links:[Link](https://arxiv.org/abs/2506.09987)Cited by:[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.15.14.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p9.1)\.
- \[32\]D\. Li, Y\. Fang, Y\. Chen, S\. Yang, S\. Cao, J\. Wong, M\. Luo, X\. Wang, H\. Yin, J\. E\. Gonzalez, I\. Stoica, S\. Han, and Y\. Lu\(2025\)WorldModelBench: judging video generation models as world models\.Note:arXiv:2502\.20694External Links:[Link](https://arxiv.org/abs/2502.20694)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.4.4.1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.11.10.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p9.1),[§7\.1](https://arxiv.org/html/2606.15032#S7.SS1.p3.1)\.
- \[33\]H\. Li, P\. Ding, R\. Suo, Y\. Wang, Z\. Ge, D\. Zang, K\. Yu, M\. Sun, H\. Zhang, D\. Wang, and W\. Su\(2025\)VLA\-RFT: vision\-language\-action reinforcement fine\-tuning with verified rewards in world simulators\.Note:arXiv:2510\.00406External Links:[Link](https://arxiv.org/abs/2510.00406)Cited by:[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.10.9.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p5.1)\.
- \[34\]Y\. Li, Z\. Zhou, Y\. Chen, Y\. Xue, and Y\. Zhu\(2026\)dWorldEval: scalable robotic policy evaluation via discrete diffusion world model\.Note:arXiv:2604\.22152External Links:[Link](https://arxiv.org/abs/2604.22152)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p2.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px2.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.1.2.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p11.1),[§4](https://arxiv.org/html/2606.15032#S4.p15.1),[§5](https://arxiv.org/html/2606.15032#S5.SS0.SSS0.Px6.p1.3)\.
- \[35\]J\. Liang, R\. Liu, E\. Ozguroglu, S\. Sudhakar, A\. Dave, P\. Tokmakov, S\. Song, and C\. Vondrick\(2024\)Dreamitate: real\-world visuomotor policy learning via video generation\.InConference on Robot Learning,External Links:[Link](https://arxiv.org/abs/2406.16862)Cited by:[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.19.17.1.1.1)\.
- \[36\]Y\. Liao, P\. Zhou, S\. Huang, D\. Yang, S\. Chen, Y\. Jiang, Y\. Hu, S\. Liu, J\. Luo, L\. Chen, S\. Yan, M\. Yao, and G\. Ren\(2026\)Genie envisioner: a unified world foundation platform for robotic manipulation\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=fHLtSxDFKC)Cited by:[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.22.20.1.1.1)\.
- \[37\]H\. Lin, S\. Xiao, Y\. Li, Z\. Zhang, Y\. Sun, C\. Jia, and Y\. Yu\(2026\)ADM\-v2: pursuing full\-horizon roll\-out in dynamics models for offline policy learning and evaluation\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ICbXEwqpga)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p6.1),[§3\.2](https://arxiv.org/html/2606.15032#S3.SS2.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.4.3.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p17.1),[§5](https://arxiv.org/html/2606.15032#S5.SS0.SSS0.Px3.p2.1)\.
- \[38\]X\. Liu, Z\. Bai, H\. Ci, K\. Y\. Ma, and M\. Z\. Shou\(2026\)World\-VLA\-Loop: closed\-loop learning of video world model and VLA policy\.Note:arXiv:2602\.06508External Links:[Link](https://arxiv.org/abs/2602.06508)Cited by:[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.19.18.1.1.1)\.
- \[39\]L\. Maes, Q\. L\. Lidec, D\. Scieur, Y\. LeCun, and R\. Balestriero\(2026\)LeWorldModel: stable end\-to\-end joint\-embedding predictive architecture from pixels\.Note:arXiv:2603\.19312External Links:[Link](https://arxiv.org/abs/2603.19312)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p5.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.6.4.1.1),[§3\.4](https://arxiv.org/html/2606.15032#S3.SS4.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px4.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 5](https://arxiv.org/html/2606.15032#S3.T5.1.7.6.1.1.1)\.
- \[40\]L\. Mur\-Labadia, M\. Muckley, A\. Bar, M\. Assran, K\. Sinha, M\. Rabbat, Y\. LeCun, N\. Ballas, and A\. Bardes\(2026\)V\-JEPA 2\.1: unlocking dense features in video self\-supervised learning\.Note:arXiv:2603\.14482External Links:[Link](https://arxiv.org/abs/2603.14482)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p5.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.6.4.1.1),[§3\.4](https://arxiv.org/html/2606.15032#S3.SS4.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px4.p1.1),[Table 5](https://arxiv.org/html/2606.15032#S3.T5.1.6.5.1.1.1)\.
- \[41\]NVIDIA, A\. Ali, J\. Bai, M\. Bala, Y\. Balaji, A\. Blakeman, T\. Cai, J\. Cao, T\. Cao, E\. Cha, Y\. Chao, P\. Chattopadhyay, M\. Chen, Y\. Chen, Y\. Chen, S\. Cheng, Y\. Cui, J\. Diamond, Y\. Ding, J\. Fan, L\. Fan, L\. Feng, F\. Ferroni, S\. Fidler, X\. Fu, R\. Gao, Y\. Ge, J\. Gu, A\. Gupta, S\. Gururani, I\. E\. Hanafi, A\. Hassani, Z\. Hao, J\. Huffman, J\. Jang, P\. Jannaty, J\. Kautz, G\. Lam, X\. Li, Z\. Li, M\. Liao, C\. Lin, T\. Lin, Y\. Lin, H\. Ling, M\. Liu, X\. Liu, Y\. Lu, A\. Luo, Q\. Ma, H\. Mao, K\. Mo, S\. Nah, Y\. Narang, A\. Panaskar, L\. Pavao, T\. Pham, M\. Ramezanali, F\. Reda, S\. Reed, X\. Ren, H\. Shao, Y\. Shen, S\. Shi, S\. Song, B\. Stefaniak, S\. Sun, S\. Tang, S\. Tasmeen, L\. Tchapmi, W\. Tseng, J\. Varghese, A\. Z\. Wang, H\. Wang, H\. Wang, H\. Wang, T\. Wang, F\. Wei, J\. Xu, D\. Yang, X\. Yang, H\. Ye, S\. Ye, X\. Zeng, J\. Zhang, Q\. Zhang, K\. Zheng, A\. Zhu, and Y\. Zhu\(2025\)World simulation with video foundation models for physical AI\.Technical reportNVIDIA\.Note:Technical report; project page branded as Cosmos\-Predict2\.5External Links:[Link](https://research.nvidia.com/labs/dir/cosmos-predict2.5/)Cited by:[Table 5](https://arxiv.org/html/2606.15032#S3.T5.1.2.1.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p3.1)\.
- \[42\]NVIDIA\(2025\)PBench: a physical AI benchmark for world models\.External Links:[Link](https://research.nvidia.com/labs/cosmos-lab/pbench/)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.4.4.1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.14.13.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p3.1),[§4](https://arxiv.org/html/2606.15032#S4.p9.1),[§7\.1](https://arxiv.org/html/2606.15032#S7.SS1.p3.1)\.
- \[43\]Y\. Qin, Z\. Shi, J\. Yu, X\. Wang, E\. Zhou, L\. Li, Z\. Yin, X\. Liu, L\. Sheng, J\. Shao, L\. Bai, W\. Ouyang, and R\. Zhang\(2024\)WorldSimBench: towards video generation models as world simulators\.Note:arXiv:2410\.18072External Links:[Link](https://arxiv.org/abs/2410.18072)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p3.1),[§2\.2](https://arxiv.org/html/2606.15032#S2.SS2.SSS0.Px2.p5.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.4.4.1.1),[§3\.1](https://arxiv.org/html/2606.15032#S3.SS1.p1.1),[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.3.2.1.1.1),[§7\.1](https://arxiv.org/html/2606.15032#S7.SS1.p3.1)\.
- \[44\]J\. Quevedo, A\. K\. Sharma, Y\. Sun, V\. Suryavanshi, P\. Liang, and S\. Yang\(2025\)WorldGym: world model as an environment for policy evaluation\.Note:arXiv:2506\.00613External Links:[Link](https://arxiv.org/abs/2506.00613)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p4.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.5.4.1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px2.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.3.1.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p11.1),[§4](https://arxiv.org/html/2606.15032#S4.p15.1),[§5](https://arxiv.org/html/2606.15032#S5.SS0.SSS0.Px6.p1.3)\.
- \[45\]Y\. Shang, Z\. Li, Y\. Ma, W\. Su, X\. Jin, Z\. Wang, L\. Jin, X\. Zhang, Y\. Tang, H\. Su, C\. Gao, W\. Wu, X\. Liu, D\. Shah, Z\. Zhang, Z\. Chen, J\. Zhu, Y\. Tian, T\. Chua, W\. Zhu, and Y\. Li\(2026\)WorldArena: a unified benchmark for evaluating perception and functional utility of embodied world models\.Note:arXiv:2602\.08971External Links:[Link](https://arxiv.org/abs/2602.08971)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p2.1),[§1](https://arxiv.org/html/2606.15032#S1.p6.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p4.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.5.4.1.1),[§3\.1](https://arxiv.org/html/2606.15032#S3.SS1.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px2.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px4.p1.1),[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.8.7.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p7.1),[§4](https://arxiv.org/html/2606.15032#S4.p9.1),[§7\.1](https://arxiv.org/html/2606.15032#S7.SS1.p3.1)\.
- \[46\]A\. K\. Sharma, Y\. Sun, N\. Lu, Y\. Zhang, J\. Liu, and S\. Yang\(2026\)World\-Gymnast: training robots with reinforcement learning in a world model\.Note:arXiv:2602\.02454External Links:[Link](https://arxiv.org/abs/2602.02454)Cited by:[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.14.13.1.1.1)\.
- \[47\]W\. Tseng, J\. Gu, Q\. Zhang, H\. Mao, M\. Liu, F\. Shkurti, and L\. Yen\-Chen\(2025\)Scalable policy evaluation with video world models\.Note:arXiv:2511\.11520External Links:[Link](https://arxiv.org/abs/2511.11520)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p2.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.5.3.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p15.1),[§5](https://arxiv.org/html/2606.15032#S5.SS0.SSS0.Px6.p1.3)\.
- \[48\]K\. Vafa, J\. Y\. Chen, A\. Rambachan, J\. Kleinberg, and S\. Mullainathan\(2024\)Evaluating the world model implicit in a generative model\.Note:arXiv:2406\.03689External Links:[Link](https://arxiv.org/abs/2406.03689)Cited by:[§2\.5](https://arxiv.org/html/2606.15032#S2.SS5.SSS0.Px4.p1.1),[§3\.4](https://arxiv.org/html/2606.15032#S3.SS4.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 5](https://arxiv.org/html/2606.15032#S3.T5.1.8.7.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p21.1)\.
- \[49\]B\. Wang, H\. Zhang, S\. Zhang, J\. Hao, M\. Jia, Q\. Lv, Y\. Mao, Z\. Lyu, J\. Zeng, X\. Xu, and J\. Pang\(2026\)RoboVIP: multi\-view video generation with visual identity prompting augments robot manipulation\.Note:arXiv:2601\.05241External Links:[Link](https://arxiv.org/abs/2601.05241)Cited by:[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.16.14.1.1.1)\.
- \[50\]R\. Wang, Q\. Liu, Y\. Deng, G\. Liu, Z\. Liu, and K\. Jia\(2026\)EVA: aligning video world models with executable robot actions via inverse dynamics rewards\.Note:arXiv:2603\.17808External Links:[Link](https://arxiv.org/abs/2603.17808)Cited by:[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.7.4.1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.23.21.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p17.1)\.
- \[51\]Y\. Wang, R\. Syed, F\. Wu, M\. Zhang, A\. Onol, J\. Barreiros, H\. Nayyeri, T\. Dear, H\. Zhang, and Y\. Li\(2026\)Interactive world simulator for robot policy training and evaluation\.Note:arXiv:2603\.08546External Links:[Link](https://arxiv.org/abs/2603.08546)Cited by:[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.17.15.1.1.1)\.
- \[52\]A\. Warrier, D\. Nguyen, M\. Naim, M\. Jain, Y\. Liang, K\. Schroeder, C\. Yang, J\. B\. Tenenbaum, S\. Vollmer, K\. Ellis, and Z\. Tavares\(2025\)Benchmarking world\-model learning with environment\-level queries\.Note:arXiv:2510\.19788External Links:[Link](https://arxiv.org/abs/2510.19788)Cited by:[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.13.12.1.1.1)\.
- \[53\]P\. Wu, A\. Escontrela, D\. Hafner, K\. Goldberg, and P\. Abbeel\(2023\)DayDreamer: world models for physical robot learning\.InConference on Robot Learning,External Links:[Link](https://arxiv.org/abs/2206.14176)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p1.2),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.2.4.1.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.3.4.1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.6.5.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p17.1)\.
- \[54\]J\. Xiao, Y\. Yang, X\. Chang, R\. Chen, F\. Xiong, M\. Xu, W\. Zheng, and Q\. Zhang\(2026\)World\-Env: leveraging world model as a virtual environment for VLA post\-training\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,External Links:[Link](https://arxiv.org/abs/2509.24948)Cited by:[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.13.12.1.1.1)\.
- \[55\]M\. Xu, T\. Zhang, T\. Liu, Z\. Chen, X\. Han, and Z\. Liu\(2026\)Kinema4D: kinematic 4d world modeling for spatiotemporal embodied simulation\.Note:arXiv:2603\.16669External Links:[Link](https://arxiv.org/abs/2603.16669)Cited by:[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.9.7.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p5.1)\.
- \[56\]J\. Yang, K\. Lin, J\. Li, W\. Zhang, T\. Lin, L\. Wu, Z\. Su, H\. Zhao, Y\. Zhang, L\. Chen, P\. Luo, X\. Yue, and H\. Li\(2026\)RISE: self\-improving robot policy with compositional world model\.Note:arXiv:2602\.11075External Links:[Link](https://arxiv.org/abs/2602.11075)Cited by:[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.15.14.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p13.1)\.
- \[57\]L\. Yang, Y\. Bai, G\. Eskandar, F\. Shen, M\. Altillawi, D\. Chen, S\. Majumder, Z\. Liu, G\. Kutyniok, and A\. Valada\(2025\)RoboEnvision: a long\-horizon video generation model for multi\-task robot manipulation\.InIEEE/RSJ International Conference on Intelligent Robots and Systems,External Links:[Link](https://arxiv.org/abs/2506.22007)Cited by:[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.21.19.1.1.1)\.
- \[58\]M\. Yang, Y\. Du, K\. Ghasemipour, J\. Tompson, D\. Schuurmans, and P\. Abbeel\(2024\)Learning interactive real\-world simulators\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2310.06114)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p4.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.5.4.1.1),[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.7.6.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p3.1)\.
- \[59\]Z\. Yang, Y\. Jin, L\. Qi, C\. Huang, and K\. Chen\(2026\)EA\-WM: event\-aware generative world model with structured kinematic\-to\-visual action fields\.Note:arXiv:2605\.06192External Links:[Link](https://arxiv.org/abs/2605.06192)Cited by:[Table 5](https://arxiv.org/html/2606.15032#S3.T5.1.4.3.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p7.1),[§4](https://arxiv.org/html/2606.15032#S4.p9.1)\.
- \[60\]T\. Yin, Z\. Mei, Z\. Zheng, M\. Yamane, D\. Wang, J\. Sceats, S\. M\. Bateman, L\. Zha, A\. Badithela, O\. Shorinwa, and A\. Majumdar\(2026\)PlayWorld: learning robot world models from autonomous play\.Note:arXiv:2603\.09030External Links:[Link](https://arxiv.org/abs/2603.09030)Cited by:[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.20.19.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p13.1)\.
- \[61\]H\. Yue, S\. Huang, Y\. Liao, S\. Chen, P\. Zhou, L\. Chen, M\. Yao, and G\. Ren\(2025\)EWMBench: evaluating scene, motion, and semantic quality in embodied world models\.Note:arXiv:2505\.09694External Links:[Link](https://arxiv.org/abs/2505.09694)Cited by:[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[Table 2](https://arxiv.org/html/2606.15032#S3.T2.1.4.3.1.1.1)\.
- \[62\]J\. Zhang, Z\. Huang, C\. Gu, Z\. Ma, and L\. Zhang\(2025\)ProphRL: reinforcing action policies by prophesying\.Note:arXiv:2511\.20633External Links:[Link](https://arxiv.org/abs/2511.20633)Cited by:[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.11.10.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p13.1),[§4](https://arxiv.org/html/2606.15032#S4.p5.1)\.
- \[63\]Z\. Zhang, R\. Chen, J\. Ye, Y\. Sun, P\. Wang, J\. Pang, K\. Li, T\. Liu, H\. Lin, Y\. Yu, and Z\. Zhou\(2024\)WHALE: towards generalizable and scalable world models for embodied decision\-making\.Note:arXiv:2411\.05619External Links:[Link](https://arxiv.org/abs/2411.05619)Cited by:[§1](https://arxiv.org/html/2606.15032#S1.p6.1),[§3\.2](https://arxiv.org/html/2606.15032#S3.SS2.p1.1),[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.2.1.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p11.1),[§5](https://arxiv.org/html/2606.15032#S5.SS0.SSS0.Px8.p2.1)\.
- \[64\]Z\. Zhang, H\. Ren, Y\. Sun, Y\. Sheng, H\. Wang, H\. Lin, Z\. Wu, P\. Bacon, and Y\. Yu\(2026\)Towards practical world model\-based reinforcement learning for vision\-language\-action models\.Note:ICLR 2026 World Models WorkshopExternal Links:[Link](https://openreview.net/forum?id=gB1yFEd106)Cited by:[§3\.6](https://arxiv.org/html/2606.15032#S3.SS6.p4.1),[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.21.20.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p13.1)\.
- \[65\]S\. Zhou, Y\. Du, J\. Chen, Y\. Li, D\. Yeung, and C\. Gan\(2024\)RoboDreamer: learning compositional world models for robot imagination\.InInternational Conference on Machine Learning,External Links:[Link](https://arxiv.org/abs/2404.12377)Cited by:[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.20.18.1.1.1)\.
- \[66\]F\. Zhu, H\. Wu, S\. Guo, Y\. Liu, C\. Cheang, and T\. Kong\(2025\)IRASim: a fine\-grained world model for robot manipulation\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 9834–9844\.External Links:[Link](https://openaccess.thecvf.com/content/ICCV2025/html/Zhu_IRASim_A_Fine-Grained_World_Model_for_Robot_Manipulation_ICCV_2025_paper.html)Cited by:[§2\.1](https://arxiv.org/html/2606.15032#S2.SS1.p4.1),[Table 1](https://arxiv.org/html/2606.15032#S2.T1.1.5.4.1.1),[Table 4](https://arxiv.org/html/2606.15032#S3.T4.1.7.5.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p11.1),[§4](https://arxiv.org/html/2606.15032#S4.p5.1)\.
- \[67\]F\. Zhu, Z\. Yan, Z\. Hong, Q\. Shou, X\. Ma, and S\. Guo\(2026\)WMPO: world model\-based policy optimization for vision\-language\-action models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=qE2FyvRvuF)Cited by:[§3\.5](https://arxiv.org/html/2606.15032#S3.SS5.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2606.15032#S3.T3.1.12.11.1.1.1),[§4](https://arxiv.org/html/2606.15032#S4.p17.1)\.
How Should World Models Be Evaluated? A Decision-Making-Centric Position

Similar Articles

World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

World Model for Robot Learning: A Comprehensive Survey

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation

Submit Feedback

Similar Articles

World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications
WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
World Model for Robot Learning: A Comprehensive Survey
World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning
stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation