LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
Summary
This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.
View Cached Full Text
Cached at: 05/12/26, 07:27 AM
Paper page - LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
Source: https://huggingface.co/papers/2605.08985
Abstract
Efficient visual encoding for high-resolution inputs in multimodal large language models is achieved through slice-based encoding and intra-ViT early compression, reducing computational costs while maintaining performance.
Visual encodingconstitutes a major computational bottleneck inMultimodal Large Language Models(MLLMs), especially for high-resolution image inputs. The prevailing practice typically adoptsglobal encodingfollowed bypost-ViT compression.Global encodingproduces massive token sequences, whilepost-ViT compressionincurs the full quadratic attention cost of the ViT before anytoken reductiontakes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show thatslice-based encodingoutperformsglobal encodingacross benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowersvisual-encoding FLOPswhile preserving downstream performance. By integratingintra-ViT compressioninto theslice-based encodingframework, we present LLaVA-UHD v4, an efficient and compute-controllablevisual encodingscheme tailored forhigh-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reducesvisual-encoding FLOPsby 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.
View arXiv pageView PDFProject pageGitHub4Add to collection
Get this paper in your agent:
hf papers read 2605\.08985
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### openbmb/MiniCPM-V-4.6 Image-Text-to-Text• 1B• Updatedabout 1 hour ago • 308
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.08985 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.08985 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
River-LLM: Large Language Model Seamless Exit Based on KV Share
River-LLM proposes a training-free early-exit framework for decoder-only LLMs that uses KV-sharing to eliminate KV-cache gaps, achieving 1.71–2.16× speedup without quality loss.
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
MiniCPM-V 4.5 is an 8B multimodal large language model that achieves high efficiency and strong performance through a unified 3D-Resampler architecture, a novel data strategy, and a hybrid reinforcement learning approach. The model reportedly surpasses larger proprietary and open-source benchmarks while significantly reducing GPU memory usage and inference time.
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
R²-dLLM introduces spatio-temporal redundancy reduction techniques that cut diffusion LLM decoding steps by up to 75% while preserving generation quality, addressing a key deployment bottleneck.
Large Vision-Language Models Get Lost in Attention
This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
LLaDA2.0-Uni unifies multimodal understanding and generation within a single diffusion-based large language model architecture.