Tag
A flow-matching model generates diverse human grasps from RGB-D images, enabling zero-shot robotic grasping with improved performance over existing methods. The model, trained on a large egocentric dataset, significantly outperforms state-of-the-art baselines on a new benchmark.
This paper introduces Geometric Primary Structure (GPS), a new representation for articulated parts perception in robot manipulation, enabling efficient VR-based annotation and achieving a 73% success rate without fine-tuning.
AFUN proposes an affordance foundation model that predicts functional masks and 3D motion curves from RGB-D observations and language descriptions, enabling generalizable robot manipulation across diverse environments. The model outperforms baselines on multiple benchmarks and can be deployed for real-world tasks without fine-tuning.
This paper proposes COVER, a training-free method for converting 3D assets into sparse panoramic RGB-D-pose data with complete scene coverage and low redundancy, and introduces the CM-EVS dataset containing 36,373 curated frames from indoor and outdoor scenes.