HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining
Summary
This paper finds that egocentric human video, when processed with a filtering and labeling pipeline, can outperform teleoperated real-robot data for pretraining embodied foundation models, achieving lower validation loss and higher success rates on real-robot tasks.
View Cached Full Text
Cached at: 06/20/26, 02:28 PM
Paper page - HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining
Source: https://huggingface.co/papers/2606.20521 Published on Jun 18
·
Submitted byhttps://huggingface.co/yfdeng10
yfdengon Jun 19
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Egocentric human video can effectively replace teleoperated robot trajectories for embodied model pretraining, achieving better performance with reduced data collection costs.
Embodied foundation modelsare expected to benefit fromdata scalinglike large language models, but face a much tighter data bottleneck.Teleoperated real-robot trajectoriesremain the dominantpretrainingsource due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral andenvironmental diversity. These limitations have sparked interest inegocentric human videoas a scalable, substantially lower-cost, and more diverse alternative for embodied modelpretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparingegocentric human videoandteleoperated real-robot trajectoriesaspretrainingdata sources forembodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering andlabeling pipeline, is not merely a viable substitute for modelpretrainingbut can lead to superior performance. With the same amount ofpretrainingdata, models pretrained on egocentric data achieve a 24% lower validation loss on real-robotaction prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robottask execution, respectively. This finding verifies a scalable paradigm forembodied foundation models: pretrain onegocentric human videoto learn diverse world representations, then adapt with a small amount of labeled real-robot data foraction-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.
View arXiv pageView PDFAdd to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.20521 in a model README.md to link it from this page.
Datasets citing this paper1
#### cy0307/awesome-egocentric-atlas Updatedabout 2 hours ago • 335 • 1
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.20521 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining
ACE-EGO-0 is a unified Vision-Language-Action pretraining framework that leverages egocentric human videos and robot trajectories via a reliability-aware training objective, achieving state-of-the-art on embodied AI benchmarks.
HumanNet: Scaling Human-centric Video Learning to One Million Hours
HumanNet is a large-scale human-centric video dataset with one million hours of annotated footage, designed to train vision-language-action models. It demonstrates that egocentric human video can effectively replace robot data for embodied intelligence tasks.
ActiveMimic: Egocentric Video Pretraining with Active Perception
ActiveMimic is a pretraining framework that recovers camera and wrist trajectories from egocentric human video to model active perception as a viewpoint action, enabling robot pretraining that matches the performance of models trained directly on robot data.
EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video
EgoPhys introduces a framework to construct deformable physical digital twins from egocentric RGB video using generalizable priors and a compact codebook, enabling zero-shot generalization to unseen objects without per-spring optimization. The system is demonstrated on a real robot, showing that egocentric human play video can serve as internal world representation for deformable-object planning.
Human Universal Grasping
A flow-matching model generates diverse human grasps from RGB-D images, enabling zero-shot robotic grasping with improved performance over existing methods. The model, trained on a large egocentric dataset, significantly outperforms state-of-the-art baselines on a new benchmark.