GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

Hugging Face Daily Papers Papers

Summary

GeoStack introduces a geometric framework to compose independently trained domain experts in Vision-Language Models without catastrophic forgetting, achieving constant-time inference and a 10x reduction in geometric error.

We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently trained domain experts to be composed into a unified model. By imposing geometric and structural constraints on the adapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate a weight-folding property that achieves constant-time inference complexity (O(1)), regardless of the number of integrated experts. Experimental results across multi-domain adaptation and class-incremental learning show that GeoStack provides an efficient mechanism for long-term knowledge composition while significantly mitigating catastrophic forgetting. Code is available at https://github.com/QuantitativeImagingLaboratory/GeoStack.
Original Article
View Cached Full Text

Cached at: 05/08/26, 06:28 PM

Paper page - GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

Source: https://huggingface.co/papers/2605.06477 https://huggingface.co/login?next=%2Fpapers%2F2605.06477-

Abstract

GeoStack is a modular framework that composes domain experts in Vision-Language Models while preserving foundational knowledge and enabling constant-time inference through geometric constraints on adapter manifolds.

We address the challenge of knowledge composition inVision-Language Models(VLMs), where accumulating expertise across multiple domains or tasks typically leads tocatastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently traineddomain expertsto be composed into a unified model. By imposing geometric and structural constraints on theadapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate aweight-folding propertythat achieves constant-time inference complexity (O(1)), regardless of the number of integrated experts. Experimental results acrossmulti-domain adaptationandclass-incremental learningshow that GeoStack provides an efficient mechanism for long-term knowledge composition while significantly mitigatingcatastrophic forgetting. Code is available at https://github.com/QuantitativeImagingLaboratory/GeoStack.

View arXiv pageView PDFProject pageGitHub0Add to collection

Community

Paper submitter

about 3 hours ago

How many domain experts can you stack before a VLM collapses? 🧱

intro

GeoStack introduces a geometric framework to compose independently trained experts into a single model with zero added inference cost. By using a perturbation prior and orthogonality constraints, it achieves a 10x reduction in geometric error compared to standard adapters.

If you’re looking for a way to build specialized VLMs that don’t forget their foundational knowledge, check this out!

Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.

Tap or paste here to upload images

https://huggingface.co/login?next=%2Fpapers%2F2605.06477-

Get this paper in your agent:

hf papers read 2605\.06477

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.06477 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.06477 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.06477 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Hugging Face Daily Papers

This paper presents a systematic frozen-feature probing study comparing vision-language models (VLMs) and video generation models (VGMs) on spatial intelligence tasks. It finds that VLMs excel at semantic tagging and instance grouping, while VGMs provide better dense geometry and camera motion signals, and a naive fusion of both yields strong performance across all axes.