ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining
Summary
ACE-EGO-0 is a unified Vision-Language-Action pretraining framework that leverages egocentric human videos and robot trajectories via a reliability-aware training objective, achieving state-of-the-art on embodied AI benchmarks.
View Cached Full Text
Cached at: 06/17/26, 03:35 AM
Paper page - ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining
Source: https://huggingface.co/papers/2606.17200
Abstract
A unified Vision-Language-Action pretraining framework leverages heterogeneous data sources including human egocentric videos and robot trajectories through a reliability-aware training approach that improves performance on embodied AI tasks.
Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scalingrobot trajectory collectionis costly and labor-intensive. Recent advances show that large-scaleegocentric human videosprovide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision fromegocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-formatpseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses aunified action representationbased oncamera-space actions, morphology conditioning, andtime-aligned action chunking. To robustly leverage noisy pseudo-action supervision fromegocentric human videos, we formulate areliability-aware training objectivewith ahuman auxiliary lossthat concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2606\.17200
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.17200 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.17200 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.17200 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
ActiveMimic: Egocentric Video Pretraining with Active Perception
ActiveMimic is a pretraining framework that recovers camera and wrist trajectories from egocentric human video to model active perception as a viewpoint action, enabling robot pretraining that matches the performance of models trained directly on robot data.
Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack
HyVLA-0.5 is an end-to-end robotic learning system that integrates data collection, model design, pre-training, fine-tuning, and reinforcement learning for real-world deployment.
EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video
EgoPhys introduces a framework to construct deformable physical digital twins from egocentric RGB video using generalizable priors and a compact codebook, enabling zero-shot generalization to unseen objects without per-spring optimization. The system is demonstrated on a real robot, showing that egocentric human play video can serve as internal world representation for deformable-object planning.
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
Proposes AR-VLA, an autoregressive action expert that generates continuous action sequences with long-term memory for context-aware robotic policy training, improving trajectory smoothness and task success rates over reactive VLA models.
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
MobileEgo Anywhere is a mobile-based framework for collecting long-duration egocentric robot data using smartphone sensors, enabling large-scale training of vision-language-action models by lowering hardware barriers.