EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video
Summary
EgoPhys introduces a framework to construct deformable physical digital twins from egocentric RGB video using generalizable priors and a compact codebook, enabling zero-shot generalization to unseen objects without per-spring optimization. The system is demonstrated on a real robot, showing that egocentric human play video can serve as internal world representation for deformable-object planning.
View Cached Full Text
Cached at: 06/16/26, 03:32 PM
Paper page - EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video
Source: https://huggingface.co/papers/2606.16202
Abstract
EgoPhys enables deformable digital twin generation from egocentric RGB video by using generalizable priors and compact codebooks to predict dense spring stiffness fields without per-spring optimization.
Humans naturally understand object physics through everyday interactions, but faithfully predicting complex deformable dynamics, such as elastic materials and fabrics, remains a major challenge for computer vision and robotics. We present EgoPhys, a framework that constructs deformable physical digital twins from egocentric RGB-only video usinggeneralizable priors. EgoPhys overcomes the limitations of existing methods to enable controllable deformable digital twin generation from egocentric videos by distilling per-objectinverse-physics solutionsinto a compactcodebook, enabling prediction ofdense spring stiffness fieldsfor unseen objects without per-spring test-time optimization. Trained withgeneralizable priorsfrom diverse egocentric interactions, EgoPhys outperforms baselines in reconstruction, future prediction, andzero-shot generalization. To support training and evaluation, we curate an egocentric interaction dataset covering diverse deformable objects, scenes, and manipulation styles. We deploy EgoPhys on a realxArm6 robot, demonstrating that a digital twin initialized from a single egocentric human play video can serve as an internal world representation to aid in deformable-object planning, highlighting egocentric RGB observations as a scalable path towardreal-to-sim pipelines.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.16202
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.16202 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.16202 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.16202 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
DeVI introduces a framework that turns text-conditioned synthetic videos into physically plausible dexterous robot control via a hybrid 3D-2D tracking reward, enabling zero-shot generalization to unseen objects.
EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera
EgoForce is a monocular 3D hand reconstruction framework that uses a unified network with differentiable forearm representation, arm-hand transformers, and ray space solvers to recover absolute hand pose and position across different camera models, achieving state-of-the-art accuracy on egocentric benchmarks.
PhysBrain 1.0 Technical Report
PhysBrain 1.0 is a technical report presenting a method that uses human egocentric video to generate physical commonsense supervision for vision-language-action models, achieving state-of-the-art results on embodied control benchmarks including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa.
ActiveMimic: Egocentric Video Pretraining with Active Perception
ActiveMimic is a pretraining framework that recovers camera and wrist trajectories from egocentric human video to model active perception as a viewpoint action, enabling robot pretraining that matches the performance of models trained directly on robot data.
Human Universal Grasping
A flow-matching model generates diverse human grasps from RGB-D images, enabling zero-shot robotic grasping with improved performance over existing methods. The model, trained on a large egocentric dataset, significantly outperforms state-of-the-art baselines on a new benchmark.