LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Hugging Face Daily Papers Papers

Summary

This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.

Visual encoding constitutes a major computational bottleneck in Multimodal Large Language Models (MLLMs), especially for high-resolution image inputs. The prevailing practice typically adopts global encoding followed by post-ViT compression. Global encoding produces massive token sequences, while post-ViT compression incurs the full quadratic attention cost of the ViT before any token reduction takes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show that slice-based encoding outperforms global encoding across benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowers visual-encoding FLOPs while preserving downstream performance. By integrating intra-ViT compression into the slice-based encoding framework, we present LLaVA-UHD v4, an efficient and compute-controllable visual encoding scheme tailored for high-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.
Original Article
View Cached Full Text

Cached at: 05/12/26, 07:27 AM

Paper page - LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Source: https://huggingface.co/papers/2605.08985

Abstract

Efficient visual encoding for high-resolution inputs in multimodal large language models is achieved through slice-based encoding and intra-ViT early compression, reducing computational costs while maintaining performance.

Visual encodingconstitutes a major computational bottleneck inMultimodal Large Language Models(MLLMs), especially for high-resolution image inputs. The prevailing practice typically adoptsglobal encodingfollowed bypost-ViT compression.Global encodingproduces massive token sequences, whilepost-ViT compressionincurs the full quadratic attention cost of the ViT before anytoken reductiontakes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show thatslice-based encodingoutperformsglobal encodingacross benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowersvisual-encoding FLOPswhile preserving downstream performance. By integratingintra-ViT compressioninto theslice-based encodingframework, we present LLaVA-UHD v4, an efficient and compute-controllablevisual encodingscheme tailored forhigh-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reducesvisual-encoding FLOPsby 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.

View arXiv pageView PDFProject pageGitHub4Add to collection

Get this paper in your agent:

hf papers read 2605\.08985

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### openbmb/MiniCPM-V-4.6 Image-Text-to-Text• 1B• Updatedabout 1 hour ago • 308

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.08985 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.08985 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Hugging Face Daily Papers

LLaVA-OneVision-2 introduces codec-stream tokenization and windowed attention for efficient video understanding, achieving state-of-the-art performance across multiple multimodal benchmarks including video, spatial, and tracking tasks.

AdaCodec: A Predictive Visual Code for Video MLLMs

Hugging Face Daily Papers

AdaCodec reduces video encoding redundancy in multimodal LLMs by transmitting full visual tokens only when scene prediction fails, otherwise using compact inter-frame change descriptions. It outperforms per-frame RGB baselines at matched token budgets and achieves better or comparable results with significantly fewer tokens, reducing time-to-first-token from 9.26s to 1.62s.