HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Hugging Face Daily Papers Papers

Summary

This paper finds that egocentric human video, when processed with a filtering and labeling pipeline, can outperform teleoperated real-robot data for pretraining embodied foundation models, achieving lower validation loss and higher success rates on real-robot tasks.

Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.
Original Article
View Cached Full Text

Cached at: 06/20/26, 02:28 PM

Paper page - HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Source: https://huggingface.co/papers/2606.20521 Published on Jun 18

·

Submitted byhttps://huggingface.co/yfdeng10

yfdengon Jun 19

Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

Egocentric human video can effectively replace teleoperated robot trajectories for embodied model pretraining, achieving better performance with reduced data collection costs.

Embodied foundation modelsare expected to benefit fromdata scalinglike large language models, but face a much tighter data bottleneck.Teleoperated real-robot trajectoriesremain the dominantpretrainingsource due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral andenvironmental diversity. These limitations have sparked interest inegocentric human videoas a scalable, substantially lower-cost, and more diverse alternative for embodied modelpretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparingegocentric human videoandteleoperated real-robot trajectoriesaspretrainingdata sources forembodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering andlabeling pipeline, is not merely a viable substitute for modelpretrainingbut can lead to superior performance. With the same amount ofpretrainingdata, models pretrained on egocentric data achieve a 24% lower validation loss on real-robotaction prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robottask execution, respectively. This finding verifies a scalable paradigm forembodied foundation models: pretrain onegocentric human videoto learn diverse world representations, then adapt with a small amount of labeled real-robot data foraction-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.

View arXiv pageView PDFAdd to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.20521 in a model README.md to link it from this page.

Datasets citing this paper1

#### cy0307/awesome-egocentric-atlas Updatedabout 2 hours ago • 335 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.20521 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Hugging Face Daily Papers

HumanNet is a large-scale human-centric video dataset with one million hours of annotated footage, designed to train vision-language-action models. It demonstrates that egocentric human video can effectively replace robot data for embodied intelligence tasks.

ActiveMimic: Egocentric Video Pretraining with Active Perception

Hugging Face Daily Papers

ActiveMimic is a pretraining framework that recovers camera and wrist trajectories from egocentric human video to model active perception as a viewpoint action, enabling robot pretraining that matches the performance of models trained directly on robot data.

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

Hugging Face Daily Papers

EgoPhys introduces a framework to construct deformable physical digital twins from egocentric RGB video using generalizable priors and a compact codebook, enabling zero-shot generalization to unseen objects without per-spring optimization. The system is demonstrated on a real robot, showing that egocentric human play video can serve as internal world representation for deformable-object planning.

Human Universal Grasping

Hugging Face Daily Papers

A flow-matching model generates diverse human grasps from RGB-D images, enabling zero-shot robotic grasping with improved performance over existing methods. The model, trained on a large egocentric dataset, significantly outperforms state-of-the-art baselines on a new benchmark.