LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
Summary
This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.
View Cached Full Text
Cached at: 05/12/26, 07:27 AM
Paper page - LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
Source: https://huggingface.co/papers/2605.08985
Abstract
Efficient visual encoding for high-resolution inputs in multimodal large language models is achieved through slice-based encoding and intra-ViT early compression, reducing computational costs while maintaining performance.
Visual encodingconstitutes a major computational bottleneck inMultimodal Large Language Models(MLLMs), especially for high-resolution image inputs. The prevailing practice typically adoptsglobal encodingfollowed bypost-ViT compression.Global encodingproduces massive token sequences, whilepost-ViT compressionincurs the full quadratic attention cost of the ViT before anytoken reductiontakes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show thatslice-based encodingoutperformsglobal encodingacross benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowersvisual-encoding FLOPswhile preserving downstream performance. By integratingintra-ViT compressioninto theslice-based encodingframework, we present LLaVA-UHD v4, an efficient and compute-controllablevisual encodingscheme tailored forhigh-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reducesvisual-encoding FLOPsby 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.
View arXiv pageView PDFProject pageGitHub4Add to collection
Get this paper in your agent:
hf papers read 2605\.08985
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### openbmb/MiniCPM-V-4.6 Image-Text-to-Text• 1B• Updatedabout 1 hour ago • 308
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.08985 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.08985 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
LLaVA-OneVision-2 introduces codec-stream tokenization and windowed attention for efficient video understanding, achieving state-of-the-art performance across multiple multimodal benchmarks including video, spatial, and tracking tasks.
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
This paper studies how audio and visual information flow inside Audio-Visual Large Language Models (AVLLMs), revealing that AVLLMs follow sequential or parallel routing depending on input configuration, and that some tokens can be discarded after information transfer for efficiency.
Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge
This paper introduces LaViD, a framework that transfers semantic knowledge from a language-only LLM to a vision student model by generating multiple-choice questions as conceptual signatures, achieving superior fine-grained classification performance and robustness.
AdaCodec: A Predictive Visual Code for Video MLLMs
AdaCodec reduces video encoding redundancy in multimodal LLMs by transmitting full visual tokens only when scene prediction fails, otherwise using compact inter-frame change descriptions. It outperforms per-frame RGB baselines at matched token budgets and achieves better or comparable results with significantly fewer tokens, reducing time-to-first-token from 9.26s to 1.62s.
Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation
This paper proposes DPVR-LF, a modality-asymmetric routing framework for MLLMs that routes vision tokens at their saturation point into a lightweight side branch and performs late fusion, reducing visual computation while maintaining competitive performance.