LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Hugging Face Daily Papers 05/09/26, 12:00 AM Papers

multimodal-llm visual-encoding efficiency high-resolution llava research

Summary

This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.

Visual encoding constitutes a major computational bottleneck in Multimodal Large Language Models (MLLMs), especially for high-resolution image inputs. The prevailing practice typically adopts global encoding followed by post-ViT compression. Global encoding produces massive token sequences, while post-ViT compression incurs the full quadratic attention cost of the ViT before any token reduction takes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show that slice-based encoding outperforms global encoding across benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowers visual-encoding FLOPs while preserving downstream performance. By integrating intra-ViT compression into the slice-based encoding framework, we present LLaVA-UHD v4, an efficient and compute-controllable visual encoding scheme tailored for high-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.

Original Article

View Cached Full Text

Cached at: 05/12/26, 07:27 AM

Paper page - LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Source: https://huggingface.co/papers/2605.08985

Abstract

Efficient visual encoding for high-resolution inputs in multimodal large language models is achieved through slice-based encoding and intra-ViT early compression, reducing computational costs while maintaining performance.

Visual encodingconstitutes a major computational bottleneck inMultimodal Large Language Models(MLLMs), especially for high-resolution image inputs. The prevailing practice typically adoptsglobal encodingfollowed bypost-ViT compression.Global encodingproduces massive token sequences, whilepost-ViT compressionincurs the full quadratic attention cost of the ViT before anytoken reductiontakes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show thatslice-based encodingoutperformsglobal encodingacross benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowersvisual-encoding FLOPswhile preserving downstream performance. By integratingintra-ViT compressioninto theslice-based encodingframework, we present LLaVA-UHD v4, an efficient and compute-controllablevisual encodingscheme tailored forhigh-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reducesvisual-encoding FLOPsby 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.

View arXiv page View PDF Project page GitHub4 Add to collection

Get this paper in your agent:

hf papers read 2605\.08985

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### openbmb/MiniCPM-V-4.6 Image-Text-to-Text• 1B• Updatedabout 1 hour ago • 308

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.08985 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.08985 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Paper page - LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge

AdaCodec: A Predictive Visual Code for Video MLLMs

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Submit Feedback

Similar Articles

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge

AdaCodec: A Predictive Visual Code for Video MLLMs

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation