ActiveMimic: Egocentric Video Pretraining with Active Perception

Hugging Face Daily Papers Papers

Summary

ActiveMimic is a pretraining framework that recovers camera and wrist trajectories from egocentric human video to model active perception as a viewpoint action, enabling robot pretraining that matches the performance of models trained directly on robot data.

Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.
Original Article
View Cached Full Text

Cached at: 06/15/26, 12:58 PM

Paper page - ActiveMimic: Egocentric Video Pretraining with Active Perception

Source: https://huggingface.co/papers/2606.06194 Published on Jun 4

·

Submitted byhttps://huggingface.co/leolin9248

Leoon Jun 15

Abstract

ActiveMimic pretraining framework recovers camera and wrist trajectories from egocentric video to enable active perception learning that matches robot data performance.

Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, theactive perceptionbehavior inegocentric videos, where humans continuously reposition their viewpoint during manipulation, inducingcamera motionthat standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera andwrist trajectoriesfrom a single body-worn RGB camera, modelscamera motionas aviewpoint action, and jointly learnsactive perceptionand manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverseactive perceptiondemands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence thatactive perceptioncapability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirmingactive perceptionas the key to unlocking egocentric human video forrobot pretraining.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2606\.06194

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.06194 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.06194 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.06194 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

Hugging Face Daily Papers

EgoPhys introduces a framework to construct deformable physical digital twins from egocentric RGB video using generalizable priors and a compact codebook, enabling zero-shot generalization to unseen objects without per-spring optimization. The system is demonstrated on a real robot, showing that egocentric human play video can serve as internal world representation for deformable-object planning.

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

Hugging Face Daily Papers

DynaFLIP is a dynamics-aware multimodal pre-training framework that integrates motion understanding into visual perception for robot manipulation. It uses image-language-3D flow triplets and geometric regularization to improve representation learning, achieving significant gains in out-of-distribution scenarios.

Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming

arXiv cs.LG

Co-GLANCE is a real-time onboard perception and decision-making system for heterogeneous robot teams that distills vision-language model capabilities into efficient models and uses conformal prediction with selective abstention to quantify and resolve perceptual uncertainty, outperforming cloud-based VLM baselines by 25-36% while achieving 350x lower latency.