PhysBrain 1.0 Technical Report
Summary
PhysBrain 1.0 is a technical report presenting a method that uses human egocentric video to generate physical commonsense supervision for vision-language-action models, achieving state-of-the-art results on embodied control benchmarks including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa.
View Cached Full Text
Cached at: 05/18/26, 06:24 AM
Paper page - PhysBrain 1.0 Technical Report
Source: https://huggingface.co/papers/2605.15298 Authors:
,
,
,
,
,
,
,
,
,
,
,
Abstract
PhysBrain 1.0 leverages human egocentric video to generate physical commonsense supervision for vision-language-action models, achieving state-of-the-art performance in embodied control tasks through capability-preserving adaptation.
Vision-language-action modelshave advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structuredphysical commonsense supervisionbefore robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred toVLA policiesthrough a capability-preserving andlanguage-sensitive adaptationdesign. Acrossmultimodal QA benchmarksandembodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.
View arXiv pageView PDFProject pageGitHub4Add to collection
Get this paper in your agent:
hf papers read 2605\.15298
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.15298 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.15298 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.15298 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
The paper introduces SeePhys Pro, a benchmark to diagnose modality transfer issues in multimodal RL for physics reasoning, revealing that models struggle with representation-invariant reasoning and often rely on residual textual cues rather than visual evidence.
RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes
RoboStressBench proposes a benchmark for evaluating vision-language model robustness to physical visual stresses (material, viewpoint, lighting, geometry) in embodied scenes, identifying stress-specific failure modes.
Physically Viable World Models: A Case for Query-Conditioned Embodied AI
This paper argues that world models for embodied AI must be physically viable and query-conditioned, focusing on identifying the simplest physical abstraction for each intervention query rather than merely predicting observations.
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
PhyMotion proposes a physics-grounded reward system that evaluates kinematic plausibility, contact consistency, and dynamic feasibility of human motion in generated videos, achieving stronger correlation with human judgment and improving motion realism in RL-based post-training.
ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
Introduces ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence built on OmniGibson, covering 10 task categories and 29 subcategories. Experiments show active exploration substantially outperforms passive approaches, with failures mainly due to action blindness rather than perception, revealing a metacognitive gap in models compared to humans.