PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

Summary

PARCEL introduces a novel vision-language model architecture that uses pool-anchored resampling and conditioned elastic queries to improve efficiency and performance across different visual-token budgets, outperforming existing matryoshka baselines.

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:35 PM

Paper page - PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Source: https://huggingface.co/papers/2605.30126

Abstract

PARCEL is a vision-language model architecture that dynamically partitions feature extraction tasks to improve efficiency and performance across different visual-token budgets.

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference.Elastic visual-token compressionaddresses this by training a single model that can run at multiplevisual-token budgets. However, existing approaches struggle under aggressive compression.Spatial-only compression, as innested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail.Query-only compression, as innested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), avisual tokenization architecturethat dynamically partitions the labor offeature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors throughPool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines acrossvisual-token budgetswhile preserving the “train once, deploy anywhere” paradigm.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2605\.30126

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.30126 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.30126 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30126 in a Space README.md to link it from this page.

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Paper page - PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

Video2LoRA: Parametric Video Internalization for Vision-Language Models

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Stateful Visual Encoders for Vision-Language Models

Parallax: Parameterized Local Linear Attention for Language Modeling

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Submit Feedback

Similar Articles

Video2LoRA: Parametric Video Internalization for Vision-Language Models

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Stateful Visual Encoders for Vision-Language Models

Parallax: Parameterized Local Linear Attention for Language Modeling

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model