DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

robotics perception multimodal pre-training representation-learning manipulation dynamics-aware

Summary

DynaFLIP is a dynamics-aware multimodal pre-training framework that integrates motion understanding into visual perception for robot manipulation. It uses image-language-3D flow triplets and geometric regularization to improve representation learning, achieving significant gains in out-of-distribution scenarios.

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.

Original Article

View Cached Full Text

Cached at: 05/29/26, 07:03 PM

Paper page - DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

Source: https://huggingface.co/papers/2605.30350 Published on May 28

Submitted byhttps://huggingface.co/akhaliq

AKon May 29

Abstract

DynaFLIP is a dynamics-aware multimodal pre-training framework that enhances robot manipulation by integrating motion understanding into visual perception through image-language-3D flow triplets and geometric regularization techniques.

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built uponvisual encoderspre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, adynamics-aware multimodal pre-trainingframework that pushes motion understanding upstream into perception. We constructimage-language-3D flow tripletsfrom heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a smallsimplex volumein theshared hyperspherical space-- a smallersimplex volumeindicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with acosine regularizerand acontrastive objective. Our analyses show that DynaFLIP focuses oncontrol-relevant regionscritical for manipulation. The resulting dynamics-aware representations serve as reusablevisual backbonesand consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.30350

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.30350 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.30350 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30350 in a Space README.md to link it from this page.

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

Paper page - DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

D4RT: Teaching AI to see the world in four dimensions

Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

ActiveMimic: Egocentric Video Pretraining with Active Perception

@AlexiGlad: Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling But representation…

Submit Feedback

Similar Articles

D4RT: Teaching AI to see the world in four dimensions

Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

ActiveMimic: Egocentric Video Pretraining with Active Perception

@AlexiGlad: Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling But representation…