Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
Summary
Embodied-R1.5 is a unified embodied foundation model that achieves state-of-the-art performance on 16 out of 24 embodied vision-language benchmarks using multi-task balanced reinforcement learning. It introduces a Planner-Grounder-Corrector closed-loop framework for long-horizon tasks and is open-sourced to facilitate future research.
View Cached Full Text
Cached at: 06/11/26, 01:40 PM
Paper page - Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
Source: https://huggingface.co/papers/2606.11324 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Embodied-R1.5 is a unified embodied foundation model that integrates embodied reasoning capabilities and achieves state-of-the-art performance on embodied vision-language benchmarks through a multi-task balanced reinforcement learning approach.
We introduce Embodied-R1.5, a unifiedEmbodied Foundation Model(EFM) that integrates comprehensive embodied reasoning capabilities, spanningembodied cognition,task planning,correction, andpointing, within a single architecture toward general physical intelligence. Leveraging three automateddata construction pipelinesto significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design amulti-task balanced RLrecipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into aVLAwith only a small amount of data, outperforming leadingVLAmodels like π_{0.5} across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.
View arXiv pageView PDFProject pageGitHub17Add to collection
Get this paper in your agent:
hf papers read 2606\.11324
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.11324 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.11324 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.11324 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
tencent/HY-Embodied-0.5
Tencent releases HY-Embodied-0.5, a suite of foundation models designed for embodied AI agents featuring a Mixture-of-Transformers (MoT) architecture with efficient 2B and powerful 32B variants for real-world robot control and spatial-temporal reasoning.
PhysBrain 1.0 Technical Report
PhysBrain 1.0 is a technical report presenting a method that uses human egocentric video to generate physical commonsense supervision for vision-language-action models, achieving state-of-the-art results on embodied control benchmarks including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa.
Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction
This paper proposes Embodied-BenchClaw, an autonomous multi-agent system that automatically constructs embodied spatial intelligence benchmarks from user intent through a five-stage pipeline with process quality control and an extensible Skill Library.
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.
ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
Introduces ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence built on OmniGibson, covering 10 task categories and 29 subcategories. Experiments show active exploration substantially outperforms passive approaches, with failures mainly due to action blindness rather than perception, revealing a metacognitive gap in models compared to humans.