Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Hugging Face Daily Papers 06/15/26, 12:00 AM Papers

embodied-ai world-model video-generation language-conditioned diffusion-transformer robotics technical-report

Summary

Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and an 8.6M video-text corpus. It unifies embodied world modeling for robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer, achieving top benchmarks on EWMBench and DreamGen Bench.

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

Original Article

View Cached Full Text

Cached at: 06/16/26, 11:34 AM

Paper page - Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Source: https://huggingface.co/papers/2606.17030 Authors:

Abstract

We introduce Qwen-RobotWorld, alanguage-conditioned video world modelforembodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT withMLLM Action Encoding, where a 60-layerdouble-stream diffusion transformercouples frozen Qwen2.5-VL semantics withvideo-VAE latentsthroughlayer-wise joint attention; b)Embodied World Knowledge(EWK), an 8.6Mvideo-text corpus(200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+ExpertProgressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

View arXiv page View PDF Project page Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.17030 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.17030 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.17030 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Paper page - Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Qwen's Embodied World Modeling (28 minute read)

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence

Qwen-AgentWorld: Language World Models for General Agents

Qwen-Image-2.0 Technical Report

Submit Feedback

Similar Articles

Qwen's Embodied World Modeling (28 minute read)

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence

Qwen-AgentWorld: Language World Models for General Agents

Qwen-Image-2.0 Technical Report