PhysBrain 1.0 Technical Report

Hugging Face Daily Papers Papers

Summary

PhysBrain 1.0 is a technical report presenting a method that uses human egocentric video to generate physical commonsense supervision for vision-language-action models, achieving state-of-the-art results on embodied control benchmarks including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa.

Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:24 AM

Paper page - PhysBrain 1.0 Technical Report

Source: https://huggingface.co/papers/2605.15298 Authors:

,

,

,

,

,

,

,

,

,

,

,

Abstract

PhysBrain 1.0 leverages human egocentric video to generate physical commonsense supervision for vision-language-action models, achieving state-of-the-art performance in embodied control tasks through capability-preserving adaptation.

Vision-language-action modelshave advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structuredphysical commonsense supervisionbefore robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred toVLA policiesthrough a capability-preserving andlanguage-sensitive adaptationdesign. Acrossmultimodal QA benchmarksandembodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.

View arXiv pageView PDFProject pageGitHub4Add to collection

Get this paper in your agent:

hf papers read 2605\.15298

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.15298 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.15298 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.15298 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Hugging Face Daily Papers

Introduces ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence built on OmniGibson, covering 10 task categories and 29 subcategories. Experiments show active exploration substantially outperforms passive approaches, with failures mainly due to action blindness rather than perception, revealing a metacognitive gap in models compared to humans.