PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
Summary
PARCEL introduces a novel vision-language model architecture that uses pool-anchored resampling and conditioned elastic queries to improve efficiency and performance across different visual-token budgets, outperforming existing matryoshka baselines.
View Cached Full Text
Cached at: 06/02/26, 03:35 PM
Paper page - PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
Source: https://huggingface.co/papers/2605.30126
Abstract
PARCEL is a vision-language model architecture that dynamically partitions feature extraction tasks to improve efficiency and performance across different visual-token budgets.
Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference.Elastic visual-token compressionaddresses this by training a single model that can run at multiplevisual-token budgets. However, existing approaches struggle under aggressive compression.Spatial-only compression, as innested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail.Query-only compression, as innested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), avisual tokenization architecturethat dynamically partitions the labor offeature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors throughPool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines acrossvisual-token budgetswhile preserving the “train once, deploy anywhere” paradigm.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.30126
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.30126 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.30126 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.30126 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Video2LoRA: Parametric Video Internalization for Vision-Language Models
This paper introduces Video2LoRA, a method that predicts Low-Rank Adaptation (LoRA) weights directly from video representations, enabling efficient video processing in frozen vision-language models. It reduces visual token load by up to 1500x and query TTFT by 6-80x while maintaining performance on video summarization and captioning benchmarks.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
SWIM is a novel training strategy that aligns vision and language representations for fine-grained object understanding using only textual prompts, leveraging mask supervision during training to improve cross-modal attention. It introduces the NL-Refer dataset and achieves superior performance over visual-prompt-based methods.
Stateful Visual Encoders for Vision-Language Models
This paper introduces a stateful visual encoder for vision-language models that conditions visual representations on prior features, enabling better visual comparison in multi-image and agentic settings. The method shows consistent improvements across tasks such as cross-image spatial aggregation and longitudinal radiology.
Parallax: Parameterized Local Linear Attention for Language Modeling
Introduces Parallax, a parameterized local linear attention mechanism with hardware-aware optimization that improves LLM pretraining efficiency and performance, achieving Pareto improvements at 0.6B and 1.7B scales.
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
PaddleOCR-VL is a compact 0.9B vision-language model that achieves state-of-the-art performance in multilingual document parsing and element recognition by integrating NaViT-style dynamic resolution with the ERNIE language model.