Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?
Summary
Dream.exe proposes an evaluation framework that uses robotic manipulation tasks to assess video generation models' understanding of physical reality, finding that visual quality does not predict executable motion accuracy.
View Cached Full Text
Cached at: 06/05/26, 02:08 PM
Paper page - Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?
Source: https://huggingface.co/papers/2606.04811
Abstract
Video generation models were evaluated through robotic manipulation tasks to assess their ability to reflect physical reality, revealing that visual quality does not predict executable motion accuracy.
Video generation modelshave made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We proposerobotic manipulationas a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream.exe, an evaluation framework that operationalizes this criterion through avideo-to-execution pipeline. Given a scene image and a task description, Dream.exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in aphysics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality,trajectory fidelity, andexecution success. Encouragingly, several models achieve measurableexecution success, suggesting thatgenerative priorslearned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream.exe will be open-sourced at https://github.com/showlab/Dream.exe.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.04811
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.04811 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.04811 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.04811 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement
StressDream enhances video world models by steering diffusion-based imaginations toward high-impact yet plausible outcomes through optimized noise initialization with semantic and plausibility objectives, enabling robust policy evaluation and improvement.
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
DeVI introduces a framework that turns text-conditioned synthetic videos into physically plausible dexterous robot control via a hybrid 3D-2D tracking reward, enabling zero-shot generalization to unseen objects.
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
MotiMotion introduces a reasoning-then-generation framework for motion-controlled video generation that uses vision-language reasoning to refine trajectories and a confidence-aware control scheme to improve plausibility, outperforming existing approaches on a new benchmark.
Video Models Can Reason with Verifiable Rewards
VideoRLVR optimizes video diffusion models for verifiable reasoning tasks using reinforcement learning with rule-based rewards, achieving better performance than supervised methods in constraint-satisfying video generation.
'Touch dreaming' helps humanoid robots handle five tricky tasks with 90.9% higher success
Researchers from CMU and Bosch Center for AI introduced the Humanoid Transformer with Touch Dreaming (HTD) model, which uses tactile signal prediction to improve humanoid robot manipulation, achieving a 90.9% higher average success rate over the ACT baseline across five real-world tasks.