DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation
Summary
DynaFLIP is a dynamics-aware multimodal pre-training framework that integrates motion understanding into visual perception for robot manipulation. It uses image-language-3D flow triplets and geometric regularization to improve representation learning, achieving significant gains in out-of-distribution scenarios.
View Cached Full Text
Cached at: 05/29/26, 07:03 PM
Paper page - DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation
Source: https://huggingface.co/papers/2605.30350 Published on May 28
·
Submitted byhttps://huggingface.co/akhaliq
AKon May 29
Abstract
DynaFLIP is a dynamics-aware multimodal pre-training framework that enhances robot manipulation by integrating motion understanding into visual perception through image-language-3D flow triplets and geometric regularization techniques.
Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built uponvisual encoderspre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, adynamics-aware multimodal pre-trainingframework that pushes motion understanding upstream into perception. We constructimage-language-3D flow tripletsfrom heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a smallsimplex volumein theshared hyperspherical space-- a smallersimplex volumeindicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with acosine regularizerand acontrastive objective. Our analyses show that DynaFLIP focuses oncontrol-relevant regionscritical for manipulation. The resulting dynamics-aware representations serve as reusablevisual backbonesand consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.30350
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.30350 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.30350 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.30350 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
D4RT: Teaching AI to see the world in four dimensions
DeepMind introduces D4RT, a unified AI model for dynamic 4D scene reconstruction and tracking that is up to 300x more efficient than previous methods. The model uses a query-based Transformer architecture to solve complex spatial and temporal tasks for robotics and AR applications.
Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming
Co-GLANCE is a real-time onboard perception and decision-making system for heterogeneous robot teams that distills vision-language model capabilities into efficient models and uses conformal prediction with selective abstention to quantify and resolve perceptual uncertainty, outperforming cloud-based VLM baselines by 25-36% while achieving 350x lower latency.
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
DeVI introduces a framework that turns text-conditioned synthetic videos into physically plausible dexterous robot control via a hybrid 3D-2D tracking reward, enabling zero-shot generalization to unseen objects.
ActiveMimic: Egocentric Video Pretraining with Active Perception
ActiveMimic is a pretraining framework that recovers camera and wrist trajectories from egocentric human video to model active perception as a viewpoint action, enabling robot pretraining that matches the performance of models trained directly on robot data.
@AlexiGlad: Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling But representation…
Introduces Temporal Difference in Vision (TDV), a new paradigm for representation learning that relies solely on causality, eliminating the need for augmentations, masking, or cropping, and matches state-of-the-art methods like DINO and iBOT on dense spatial tasks.