GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning
Summary
GeneralVLA-2 introduces GeoFuse-MV3D for improved 3D reconstruction and a governed KnowledgeBank for better memory management in robotic manipulation tasks, achieving performance gains on several benchmarks.
View Cached Full Text
Cached at: 06/22/26, 09:30 AM
Paper page - GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning
Source: https://huggingface.co/papers/2606.17480
Abstract
GeneralVLA-2 addresses limitations in vision-language-action systems by introducing GeoFuse-MV3D for improved 3D reconstruction and an enhanced KnowledgeBank for better memory management in robotic manipulation tasks.
Generalistvision-language-action systemsneed object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the originalKnowledgeBankmainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to controlmemory quality, conflicts,confidence, and geometric relevance. To address the first challenge, we introduceGeoFuse-MV3D, a geometry-prior-guidedMV-SAM3Dreconstruction branch that verifies external geometry cues with input-view masks, applies softvisual-hull support, performsaxis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgradeKnowledgeBankinto a governed long-term memory system with explicit quality,confidence, lifecycle,verifier, and conflict metadata, together withprecision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified;GeoFuse-MV3Dimproves over theMV-SAM3Dbaseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, andKnowledgeBankimproves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.
View arXiv pageView PDFProject pageGitHub4Add to collection
Get this paper in your agent:
hf papers read 2606\.17480
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.17480 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.17480 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.17480 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies
EventVLA introduces a sparse visual evidence memory framework for long-horizon robotic manipulation, achieving an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
A better method for planning complex visual tasks
MIT researchers developed VLMFP, a two-stage generative AI approach combining vision-language models with formal planning software to achieve 70% success rate on complex visual planning tasks like robot navigation, nearly 2.3x better than existing baselines. The method automatically translates visual scenarios into planning files that classical solvers can process, enabling effective long-horizon planning in novel environments.
Just open-sourced FastVLA
FastVLA, an open-source Vision-Language-Action model, now runs 5 Hz robotics on an L4 GPU.