Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Hugging Face Daily Papers Papers

Summary

Embodied-R1.5 is a unified embodied foundation model that achieves state-of-the-art performance on 16 out of 24 embodied vision-language benchmarks using multi-task balanced reinforcement learning. It introduces a Planner-Grounder-Corrector closed-loop framework for long-horizon tasks and is open-sourced to facilitate future research.

We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into a VLA with only a small amount of data, outperforming leading VLA models like π_{0.5} across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:40 PM

Paper page - Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Source: https://huggingface.co/papers/2606.11324 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

Embodied-R1.5 is a unified embodied foundation model that integrates embodied reasoning capabilities and achieves state-of-the-art performance on embodied vision-language benchmarks through a multi-task balanced reinforcement learning approach.

We introduce Embodied-R1.5, a unifiedEmbodied Foundation Model(EFM) that integrates comprehensive embodied reasoning capabilities, spanningembodied cognition,task planning,correction, andpointing, within a single architecture toward general physical intelligence. Leveraging three automateddata construction pipelinesto significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design amulti-task balanced RLrecipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into aVLAwith only a small amount of data, outperforming leadingVLAmodels like π_{0.5} across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.

View arXiv pageView PDFProject pageGitHub17Add to collection

Get this paper in your agent:

hf papers read 2606\.11324

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.11324 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.11324 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.11324 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

tencent/HY-Embodied-0.5

Hugging Face Models Trending

Tencent releases HY-Embodied-0.5, a suite of foundation models designed for embodied AI agents featuring a Mixture-of-Transformers (MoT) architecture with efficient 2B and powerful 32B variants for real-world robot control and spatial-temporal reasoning.

PhysBrain 1.0 Technical Report

Hugging Face Daily Papers

PhysBrain 1.0 is a technical report presenting a method that uses human egocentric video to generate physical commonsense supervision for vision-language-action models, achieving state-of-the-art results on embodied control benchmarks including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa.

EasyVideoR1: Easier RL for Video Understanding

Hugging Face Daily Papers

EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Hugging Face Daily Papers

Introduces ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence built on OmniGibson, covering 10 task categories and 29 subcategories. Experiments show active exploration substantially outperforms passive approaches, with failures mainly due to action blindness rather than perception, revealing a metacognitive gap in models compared to humans.