MolmoAct2: Action Reasoning Models for Real-world Deployment
Summary
Allen AI releases MolmoAct2, an open-weight Vision-Language-Action model designed for real-world robotic deployment, featuring new datasets, an open action tokenizer, and adaptive reasoning to reduce latency.
View Cached Full Text
Cached at: 05/08/26, 08:52 AM
Paper page - MolmoAct2: Action Reasoning Models for Real-world Deployment
Source: https://huggingface.co/papers/2605.02881 Published on May 4
·
Submitted byhttps://huggingface.co/Jiafei1224
Duanon May 5
#1 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
MolmoAct2 presents an open-action reasoning model for robotics that improves upon previous systems through specialized vision-language-model backbones, new datasets, open-weight action tokenizers, architectural redesign for continuous-action prediction, and adaptive reasoning for reduced latency.
Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today’s systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, aVLM backbonespecialized for spatial andembodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-dataaction tokenizertrained on millions of trajectories across five embodiments. We redesign the architecture to graft aflow-matchingcontinuous-action expertonto adiscrete-token VLMvia per-layerKV-cache conditioning. Finally, we propose MolmoThink, anadaptive-depth reasoningvariant that re-predicts depth tokens only for scene regions that change between timesteps, retaininggeometric groundingat a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2
View arXiv pageView PDFProject pageGitHub128Add to collection
Get this paper in your agent:
hf papers read 2605\.02881
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper10
#### allenai/Molmo2-ER Image-Text-to-Text• 5B• Updatedabout 7 hours ago • 1.05k • 5
#### allenai/MolmoAct2 Robotics• 5B• Updated3 days ago • 33 • 4
#### allenai/MolmoAct2-Pretrain Robotics• 5B• Updated3 days ago • 20 • 3
#### allenai/MolmoAct2-SO100_101 Robotics• 5B• Updated3 days ago • 48 • 3
Browse 10 models citing this paper## Datasets citing this paper737
#### allenai/24112025-yam-01 Viewer• Updated3 days ago • 102k • 1.52k #### allenai/05012026-plate-cleaning-09 Viewer• Updated3 days ago • 71.7k • 575 #### allenai/28112025-yam-02 Viewer• Updated3 days ago • 61.8k • 550 #### allenai/25112025-yam-04 Viewer• Updated3 days ago • 68.3k • 547 Browse 737 datasets citing this paper### Spaces citing this paper2
Collections including this paper12
Similar Articles
MolmoAct 2
MolmoAct 2 is an open robotics model that reasons in 3D space before taking actions, developed by the Allen Institute for Artificial Intelligence.
AllenAI has been iterating on their MolmoAct2 models for robotics
AllenAI has released open-source MolmoAct2 models for robot control, with multiple fine-tuned versions for different tasks, including full datasets and training code.
AllenAI releases MolmoMotion vision models for predicting future motion based on short frame history
AllenAI releases MolmoMotion, a vision model designed to predict future motion based on a short history of frames.
MolmoMotion: Language-guided 3D motion forecasting
MolmoMotion is a new language-guided 3D motion forecasting model that predicts future 3D point trajectories from video frames and action descriptions, achieving stronger performance than existing methods. Alongside the model, a large dataset (MolmoMotion-1M) and a benchmark (PointMotionBench) are released.
Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving
Proposes Reason-Imagine-Act (RIA), a closed-loop framework coupling an LLM reasoner with an action-conditioned world model for online safety verification in autonomous driving, achieving 80.05% route completion and 0.20% collision rate in CARLA simulations.