ActiveMimic: Egocentric Video Pretraining with Active Perception
Summary
ActiveMimic is a pretraining framework that recovers camera and wrist trajectories from egocentric human video to model active perception as a viewpoint action, enabling robot pretraining that matches the performance of models trained directly on robot data.
View Cached Full Text
Cached at: 06/15/26, 12:58 PM
Paper page - ActiveMimic: Egocentric Video Pretraining with Active Perception
Source: https://huggingface.co/papers/2606.06194 Published on Jun 4
·
Submitted byhttps://huggingface.co/leolin9248
Leoon Jun 15
Abstract
ActiveMimic pretraining framework recovers camera and wrist trajectories from egocentric video to enable active perception learning that matches robot data performance.
Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, theactive perceptionbehavior inegocentric videos, where humans continuously reposition their viewpoint during manipulation, inducingcamera motionthat standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera andwrist trajectoriesfrom a single body-worn RGB camera, modelscamera motionas aviewpoint action, and jointly learnsactive perceptionand manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverseactive perceptiondemands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence thatactive perceptioncapability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirmingactive perceptionas the key to unlocking egocentric human video forrobot pretraining.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.06194
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.06194 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.06194 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.06194 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video
EgoPhys introduces a framework to construct deformable physical digital twins from egocentric RGB video using generalizable priors and a compact codebook, enabling zero-shot generalization to unseen objects without per-spring optimization. The system is demonstrated on a real robot, showing that egocentric human play video can serve as internal world representation for deformable-object planning.
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
DeVI introduces a framework that turns text-conditioned synthetic videos into physically plausible dexterous robot control via a hybrid 3D-2D tracking reward, enabling zero-shot generalization to unseen objects.
DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation
DynaFLIP is a dynamics-aware multimodal pre-training framework that integrates motion understanding into visual perception for robot manipulation. It uses image-language-3D flow triplets and geometric regularization to improve representation learning, achieving significant gains in out-of-distribution scenarios.
Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming
Co-GLANCE is a real-time onboard perception and decision-making system for heterogeneous robot teams that distills vision-language model capabilities into efficient models and uses conformal prediction with selective abstention to quantify and resolve perceptual uncertainty, outperforming cloud-based VLM baselines by 25-36% while achieving 350x lower latency.
EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera
EgoForce is a monocular 3D hand reconstruction framework that uses a unified network with differentiable forearm representation, arm-hand transformers, and ray space solvers to recover absolute hand pose and position across different camera models, achieving state-of-the-art accuracy on egocentric benchmarks.