HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Hugging Face Daily Papers 06/18/26, 12:00 AM Papers

embodied-ai pretraining egocentric-video robot-learning data-scaling foundation-models

Summary

This paper finds that egocentric human video, when processed with a filtering and labeling pipeline, can outperform teleoperated real-robot data for pretraining embodied foundation models, achieving lower validation loss and higher success rates on real-robot tasks.

Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.

Original Article

View Cached Full Text

Cached at: 06/20/26, 02:28 PM

Paper page - HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Source: https://huggingface.co/papers/2606.20521 Published on Jun 18

Submitted byhttps://huggingface.co/yfdeng10

yfdengon Jun 19

Authors:

Abstract

Egocentric human video can effectively replace teleoperated robot trajectories for embodied model pretraining, achieving better performance with reduced data collection costs.

Embodied foundation modelsare expected to benefit fromdata scalinglike large language models, but face a much tighter data bottleneck.Teleoperated real-robot trajectoriesremain the dominantpretrainingsource due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral andenvironmental diversity. These limitations have sparked interest inegocentric human videoas a scalable, substantially lower-cost, and more diverse alternative for embodied modelpretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparingegocentric human videoandteleoperated real-robot trajectoriesaspretrainingdata sources forembodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering andlabeling pipeline, is not merely a viable substitute for modelpretrainingbut can lead to superior performance. With the same amount ofpretrainingdata, models pretrained on egocentric data achieve a 24% lower validation loss on real-robotaction prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robottask execution, respectively. This finding verifies a scalable paradigm forembodied foundation models: pretrain onegocentric human videoto learn diverse world representations, then adapt with a small amount of labeled real-robot data foraction-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.

View arXiv page View PDF Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.20521 in a model README.md to link it from this page.

Datasets citing this paper1

#### cy0307/awesome-egocentric-atlas Updatedabout 2 hours ago • 335 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.20521 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Paper page - HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

HumanNet: Scaling Human-centric Video Learning to One Million Hours

ActiveMimic: Egocentric Video Pretraining with Active Perception

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

Human Universal Grasping

Submit Feedback

Similar Articles

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

HumanNet: Scaling Human-centric Video Learning to One Million Hours

ActiveMimic: Egocentric Video Pretraining with Active Perception

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video