PhysBrain 1.0 Technical Report

Hugging Face Daily Papers 05/14/26, 12:00 AM Papers

Summary

PhysBrain 1.0 is a technical report presenting a method that uses human egocentric video to generate physical commonsense supervision for vision-language-action models, achieving state-of-the-art results on embodied control benchmarks including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa.

Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.

Original Article

View Cached Full Text

Cached at: 05/18/26, 06:24 AM

Paper page - PhysBrain 1.0 Technical Report

Source: https://huggingface.co/papers/2605.15298 Authors:

Abstract

PhysBrain 1.0 leverages human egocentric video to generate physical commonsense supervision for vision-language-action models, achieving state-of-the-art performance in embodied control tasks through capability-preserving adaptation.

Vision-language-action modelshave advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structuredphysical commonsense supervisionbefore robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred toVLA policiesthrough a capability-preserving andlanguage-sensitive adaptationdesign. Acrossmultimodal QA benchmarksandembodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.

View arXiv page View PDF Project page GitHub4 Add to collection

Get this paper in your agent:

hf papers read 2605\.15298

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.15298 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.15298 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.15298 in a Space README.md to link it from this page.

PhysBrain 1.0 Technical Report

Paper page - PhysBrain 1.0 Technical Report

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

tencent/Hy-Embodied-RxBrain-1.0 · Hugging Face

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

RxBrain: Embodied Cognition Foundation Model with Joint Language-Visual Reasoning and Imagination

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Submit Feedback

Similar Articles

tencent/Hy-Embodied-RxBrain-1.0 · Hugging Face

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

RxBrain: Embodied Cognition Foundation Model with Joint Language-Visual Reasoning and Imagination

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation