Tag
Presents Qwen-RobotManip, a Vision-Language-Action foundation model for robotic manipulation that achieves generalization through unified alignment across representation, motion, and behavior dimensions, enabling large-scale training on diverse data sources. It outperforms prior state-of-the-art models across multiple out-of-distribution benchmarks and demonstrates emergent capabilities like zero-shot instruction following and cross-embodiment transfer.