GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs
Summary
GeoStack introduces a geometric framework to compose independently trained domain experts in Vision-Language Models without catastrophic forgetting, achieving constant-time inference and a 10x reduction in geometric error.
View Cached Full Text
Cached at: 05/08/26, 06:28 PM
Paper page - GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs
Source: https://huggingface.co/papers/2605.06477
https://huggingface.co/login?next=%2Fpapers%2F2605.06477-
Abstract
GeoStack is a modular framework that composes domain experts in Vision-Language Models while preserving foundational knowledge and enabling constant-time inference through geometric constraints on adapter manifolds.
We address the challenge of knowledge composition inVision-Language Models(VLMs), where accumulating expertise across multiple domains or tasks typically leads tocatastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently traineddomain expertsto be composed into a unified model. By imposing geometric and structural constraints on theadapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate aweight-folding propertythat achieves constant-time inference complexity (O(1)), regardless of the number of integrated experts. Experimental results acrossmulti-domain adaptationandclass-incremental learningshow that GeoStack provides an efficient mechanism for long-term knowledge composition while significantly mitigatingcatastrophic forgetting. Code is available at https://github.com/QuantitativeImagingLaboratory/GeoStack.
View arXiv pageView PDFProject pageGitHub0Add to collection
Community
Paper submitter
How many domain experts can you stack before a VLM collapses? 🧱
GeoStack introduces a geometric framework to compose independently trained experts into a single model with zero added inference cost. By using a perturbation prior and orthogonality constraints, it achieves a 10x reduction in geometric error compared to standard adapters.
If you’re looking for a way to build specialized VLMs that don’t forget their foundational knowledge, check this out!
Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.
Tap or paste here to upload images
https://huggingface.co/login?next=%2Fpapers%2F2605.06477-
Get this paper in your agent:
hf papers read 2605\.06477
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.06477 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.06477 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.06477 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
GeoVR enhances multimodal large language models with 3D awareness by restructuring their semantic latent space through geometric knowledge distillation from 3D foundation models using multiple geometric targets.
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
This paper proposes GASP, a framework that injects geometric priors into vision-language models via deep supervision with contrastive and depth consistency losses, achieving significant improvements on 3D spatial reasoning benchmarks without using 3D VQA data.
Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors
Stream3D-VLM is an online 3D vision-language model that enables real-time spatial understanding from streaming video by incrementally integrating geometry priors and using geometry-adaptive voxel compression, outperforming existing models on 3D spatial understanding tasks.
HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
HyperGVL introduces the first benchmark for evaluating Large Vision-Language Models on hypergraph understanding and reasoning, featuring 84,000 QA samples across 12 tasks and real-world applications. The paper also proposes WiseHyGR, a generalizable router that enhances LVLM performance through adaptive hypergraph representations.
Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models
This paper presents a systematic frozen-feature probing study comparing vision-language models (VLMs) and video generation models (VGMs) on spatial intelligence tasks. It finds that VLMs excel at semantic tagging and instance grouping, while VGMs provide better dense geometry and camera motion signals, and a naive fusion of both yields strong performance across all axes.
