Geometric Action Model for Robot Policy Learning

Hugging Face Daily Papers Papers

Summary

The Geometric Action Model (GAM) repurposes a pretrained geometric foundation model (GFM) as a unified backbone for language-conditioned robot manipulation, achieving higher accuracy, robustness, and efficiency than existing foundation-model-scale baselines across simulation and real-world benchmarks.

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:32 AM

Paper page - Geometric Action Model for Robot Policy Learning

Source: https://huggingface.co/papers/2606.17046

Abstract

A geometric action model leverages pretrained geometric foundation models to enable language-conditioned manipulation policies with improved accuracy, robustness, and efficiency in 3D physical environments.

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the3D physical world. Recentvision-language-action models(VLAs) andvideo world-action models(WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required forcontact-rich manipulation. We propose the Geometric Action Model (GAM), alanguage-conditioned manipulation policythat directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, andaction decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and acausal future predictorinserted at the split layer forecasts futurelatent tokensconditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditionedtemporal world modelingthrough minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.

View arXiv pageView PDFProject pageGitHub27Add to collection

Get this paper in your agent:

hf papers read 2606\.17046

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.17046 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.17046 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.17046 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

Revisiting Articulated Parts Perception in Robot Manipulation

Hugging Face Daily Papers

This paper introduces Geometric Primary Structure (GPS), a new representation for articulated parts perception in robot manipulation, enabling efficient VR-based annotation and achieving a 73% success rate without fine-tuning.

Learning Agentic Policy from Action Guidance

arXiv cs.CL

The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.

World Model for Robot Learning: A Comprehensive Survey

Hugging Face Daily Papers

This comprehensive survey reviews the literature on world models for robot learning, covering their roles in policy learning, planning, and simulation. It highlights key paradigms, benchmarks, and future directions for predictive modeling in embodied agents.

τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation

Hugging Face Daily Papers

τ_0-WM is a unified video-action world model for robotic manipulation that integrates policy learning, video prediction, and action evaluation using a shared video diffusion backbone. It shows superior performance on challenging long-horizon and fine-grained tasks.