Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models
Summary
Presents Qwen-RobotManip, a Vision-Language-Action foundation model for robotic manipulation that achieves generalization through unified alignment across representation, motion, and behavior dimensions, enabling large-scale training on diverse data sources. It outperforms prior state-of-the-art models across multiple out-of-distribution benchmarks and demonstrates emergent capabilities like zero-shot instruction following and cross-embodiment transfer.
View Cached Full Text
Cached at: 06/29/26, 10:05 PM
Paper page - Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models
Source: https://huggingface.co/papers/2606.17846 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A Vision-Language-Action foundation model for robotic manipulation achieves generalization through unified alignment across representation, motion, and behavior dimensions, enabling large-scale training on diverse data sources.
Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizableVision-Language-Action foundation modelbuilt on Qwen-VL. Qwen-RobotManip introduces aunified alignment frameworkacross the representation, motion, and behavioral dimensions of manipulation, makinglarge-scale multi-source trainingcoherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline convertsegocentric hand demonstrationsintorobot trajectoriesacross 15 platforms, and a rigorouscuration pipelineharmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibitsemergent generalization capabilities, includingzero-shot instruction following, robustness to perturbations,reactive error recovery, andcross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adoptOOD settingsincludingRoboCasa365,LIBERO-Plus,EBench,RoboTwin-Clean2Rand,RoboTwin-IF, andRoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including π0.5, across allOOD settings, ranks 1st inRoboChallengewith a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.
View arXiv pageView PDFProject pageGitHub0Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.17846 in a model README.md to link it from this page.
Datasets citing this paper1
#### cy0307/awesome-egocentric-atlas Viewer• Updated5 days ago • 638 • 851 • 2
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.17846 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation
Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and an 8.6M video-text corpus. It unifies embodied world modeling for robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer, achieving top benchmarks on EWMBench and DreamGen Bench.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen-VLA is a unified vision-language-action model for embodied decision-making, integrating manipulation, navigation, and trajectory prediction across different robot platforms. It uses a DiT-based action decoder and embodiment-aware prompt conditioning, achieving strong performance and out-of-distribution generalization.
Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System
Qwen-RobotNav is a scalable navigation model with a parameterized interface enabling dynamic task modes and observation parameters, achieving state-of-the-art performance through multi-task training and zero-shot generalization to real-world robotics.
Qwen's Embodied World Modeling (28 minute read)
The Qwen-RobotWorld technical report presents a unified language-conditioned video world model for embodied intelligence, enabling future video prediction from current observations across various domains like robotics, autonomous driving, and navigation, with applications in synthetic data generation, policy evaluation, and planning.
Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence
Qwen-Robot Suite is a foundation model suite designed for physical world intelligence, enabling robots to understand and interact with the real world effectively.