GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

Hugging Face Daily Papers 06/16/26, 12:00 AM Papers

Summary

GeneralVLA-2 introduces GeoFuse-MV3D for improved 3D reconstruction and a governed KnowledgeBank for better memory management in robotic manipulation tasks, achieving performance gains on several benchmarks.

Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.

Original Article

View Cached Full Text

Cached at: 06/22/26, 09:30 AM

Paper page - GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

Source: https://huggingface.co/papers/2606.17480

Abstract

GeneralVLA-2 addresses limitations in vision-language-action systems by introducing GeoFuse-MV3D for improved 3D reconstruction and an enhanced KnowledgeBank for better memory management in robotic manipulation tasks.

Generalistvision-language-action systemsneed object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the originalKnowledgeBankmainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to controlmemory quality, conflicts,confidence, and geometric relevance. To address the first challenge, we introduceGeoFuse-MV3D, a geometry-prior-guidedMV-SAM3Dreconstruction branch that verifies external geometry cues with input-view masks, applies softvisual-hull support, performsaxis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgradeKnowledgeBankinto a governed long-term memory system with explicit quality,confidence, lifecycle,verifier, and conflict metadata, together withprecision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified;GeoFuse-MV3Dimproves over theMV-SAM3Dbaseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, andKnowledgeBankimproves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.

View arXiv page View PDF Project page GitHub4 Add to collection

Get this paper in your agent:

hf papers read 2606\.17480

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.17480 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.17480 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.17480 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

Paper page - GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

A better method for planning complex visual tasks

Just open-sourced FastVLA

Submit Feedback

Similar Articles

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

A better method for planning complex visual tasks