MolmoAct2: Action Reasoning Models for Real-world Deployment

Papers with Code Trending Models

Summary

Allen AI releases MolmoAct2, an open-weight Vision-Language-Action model designed for real-world robotic deployment, featuring new datasets, an open action tokenizer, and adaptive reasoning to reduce latency.

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2
Original Article
View Cached Full Text

Cached at: 05/08/26, 08:52 AM

Paper page - MolmoAct2: Action Reasoning Models for Real-world Deployment

Source: https://huggingface.co/papers/2605.02881 Published on May 4

·

Submitted byhttps://huggingface.co/Jiafei1224

Duanon May 5

#1 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

MolmoAct2 presents an open-action reasoning model for robotics that improves upon previous systems through specialized vision-language-model backbones, new datasets, open-weight action tokenizers, architectural redesign for continuous-action prediction, and adaptive reasoning for reduced latency.

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today’s systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, aVLM backbonespecialized for spatial andembodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-dataaction tokenizertrained on millions of trajectories across five embodiments. We redesign the architecture to graft aflow-matchingcontinuous-action expertonto adiscrete-token VLMvia per-layerKV-cache conditioning. Finally, we propose MolmoThink, anadaptive-depth reasoningvariant that re-predicts depth tokens only for scene regions that change between timesteps, retaininggeometric groundingat a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2

View arXiv pageView PDFProject pageGitHub128Add to collection

Get this paper in your agent:

hf papers read 2605\.02881

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper10

#### allenai/Molmo2-ER Image-Text-to-Text• 5B• Updatedabout 7 hours ago • 1.05k • 5 #### allenai/MolmoAct2 Robotics• 5B• Updated3 days ago • 33 • 4 #### allenai/MolmoAct2-Pretrain Robotics• 5B• Updated3 days ago • 20 • 3 #### allenai/MolmoAct2-SO100_101 Robotics• 5B• Updated3 days ago • 48 • 3 Browse 10 models citing this paper## Datasets citing this paper737

#### allenai/24112025-yam-01 Viewer• Updated3 days ago • 102k • 1.52k #### allenai/05012026-plate-cleaning-09 Viewer• Updated3 days ago • 71.7k • 575 #### allenai/28112025-yam-02 Viewer• Updated3 days ago • 61.8k • 550 #### allenai/25112025-yam-04 Viewer• Updated3 days ago • 68.3k • 547 Browse 737 datasets citing this paper### Spaces citing this paper2

Collections including this paper12

Browse 12 collections that include this paper

Similar Articles

MolmoAct 2

Product Hunt

MolmoAct 2 is an open robotics model that reasons in 3D space before taking actions, developed by the Allen Institute for Artificial Intelligence.

MolmoMotion: Language-guided 3D motion forecasting

Hugging Face Blog

MolmoMotion is a new language-guided 3D motion forecasting model that predicts future 3D point trajectories from video frames and action descriptions, achieving stronger performance than existing methods. Alongside the model, a large dataset (MolmoMotion-1M) and a benchmark (PointMotionBench) are released.