Revisiting Articulated Parts Perception in Robot Manipulation
Summary
This paper introduces Geometric Primary Structure (GPS), a new representation for articulated parts perception in robot manipulation, enabling efficient VR-based annotation and achieving a 73% success rate without fine-tuning.
View Cached Full Text
Cached at: 06/12/26, 10:52 AM
Paper page - Revisiting Articulated Parts Perception in Robot Manipulation
Source: https://huggingface.co/papers/2606.08103
Abstract
A new geometric representation called Geometric Primary Structure (GPS) is introduced for articulated parts perception, enabling efficient data collection through VR annotation and achieving high manipulation success rates without fine-tuning.
We are surrounded by various objects with movable, articulated parts, e.g., box, handle, door. An accurate and generalizable perception of articulated parts is essential to enhance robotic manipulation capabilities. Building on this need, recent efforts inarticulated parts perceptionhave followed two main directions: One line of work usespose-based representation, which requires high manual cost; in parallel,affordance-based methodsextract future object motion from point tracking without additional manual efforts, but suffer from low-quality data. In this paper, we propose a new representation of articulated parts,Geometric Primary Structure(GPS), an abstraction of the part geometry structure to balance scalability and quality. For efficient and scalable data collection, GPS is integrated with a portableVirtual Reality(VR) device and requires only one minute to annotate one object sequence. This direct human annotation provides higher quality than the estimated affordance. With this efficient VR-GPS system, we collect 41K frames for 234 objects across six part classes, and train a generalizable GPS model with a singleRGB-Dobject image as input. For object manipulation, we deploy aheuristic policybased on GPS prediction. Without any in-domain fine-tuning, our method achieves an 73% success rate, covering 270 initial states for 9 objects. Our code, data and reusable tool are available at https://enlighten0707.github.io/gps.
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2606\.08103
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.08103 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.08103 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.08103 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
This paper introduces PAGER, a topology-aware agent that bridges the semantic-execution gap in point-precise GUI control, achieving 4.1x higher task success than baselines on the new PAGE Bench.
Robots Need More than VLA and World Models
This position paper argues that advancing robot intelligence requires integrating unstructured behavioral data through specialized interfaces for labeling, embodiment mapping, world modeling, and reward inference, rather than relying solely on scaling Vision-Language-Action (VLA) models and world models.
DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation
DynaFLIP is a dynamics-aware multimodal pre-training framework that integrates motion understanding into visual perception for robot manipulation. It uses image-language-3D flow triplets and geometric regularization to improve representation learning, achieving significant gains in out-of-distribution scenarios.
Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming
Co-GLANCE is a real-time onboard perception and decision-making system for heterogeneous robot teams that distills vision-language model capabilities into efficient models and uses conformal prediction with selective abstention to quantify and resolve perceptual uncertainty, outperforming cloud-based VLM baselines by 25-36% while achieving 350x lower latency.
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
DeVI introduces a framework that turns text-conditioned synthetic videos into physically plausible dexterous robot control via a hybrid 3D-2D tracking reward, enabling zero-shot generalization to unseen objects.