Tag
AHA-WAM is an asynchronous world-action model that uses dual Diffusion Transformers to decouple world prediction from action execution, achieving efficient long-horizon planning and real-time control. It achieves state-of-the-art performance on robotic manipulation tasks with up to 92.8% success on RoboTwin and 78.3% on real-world tasks, while reaching 24.17 Hz closed-loop control.
WALL-WM advances video-action learning by using semantic events as learning units instead of fixed action chunks, enabling more flexible and scalable vision-language-action training and inference.
Introduces two projects related to robot world models: Awesome-WAM (OpenMOSS) includes papers such as World Action Models and DreamDojo; awesome-physical-ai curates a collection of papers on VLA models, world models, and embodied foundation models (including NVIDIA Cosmos Predict2.5).
NVIDIA's head of robotics, Jim Fan, gave a public talk, advocating that robots should directly replicate the successful path of large language models. He proposed directions such as World Action Model (WAM), a data revolution based on human first-person video, and neural simulation, and predicted a 95% probability of achieving the endgame of general-purpose physical robots by 2040.