LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Hugging Face Daily Papers Papers

Summary

This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.

Visual encoding constitutes a major computational bottleneck in Multimodal Large Language Models (MLLMs), especially for high-resolution image inputs. The prevailing practice typically adopts global encoding followed by post-ViT compression. Global encoding produces massive token sequences, while post-ViT compression incurs the full quadratic attention cost of the ViT before any token reduction takes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show that slice-based encoding outperforms global encoding across benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowers visual-encoding FLOPs while preserving downstream performance. By integrating intra-ViT compression into the slice-based encoding framework, we present LLaVA-UHD v4, an efficient and compute-controllable visual encoding scheme tailored for high-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 07:27 AM

Paper page - LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Source: https://huggingface.co/papers/2605.08985

Abstract

Efficient visual encoding for high-resolution inputs in multimodal large language models is achieved through slice-based encoding and intra-ViT early compression, reducing computational costs while maintaining performance.

Visual encodingconstitutes a major computational bottleneck inMultimodal Large Language Models(MLLMs), especially for high-resolution image inputs. The prevailing practice typically adoptsglobal encodingfollowed bypost-ViT compression.Global encodingproduces massive token sequences, whilepost-ViT compressionincurs the full quadratic attention cost of the ViT before anytoken reductiontakes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show thatslice-based encodingoutperformsglobal encodingacross benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowersvisual-encoding FLOPswhile preserving downstream performance. By integratingintra-ViT compressioninto theslice-based encodingframework, we present LLaVA-UHD v4, an efficient and compute-controllablevisual encodingscheme tailored forhigh-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reducesvisual-encoding FLOPsby 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.

View arXiv pageView PDFProject pageGitHub4Add to collection

Get this paper in your agent:

hf papers read 2605\.08985

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### openbmb/MiniCPM-V-4.6 Image-Text-to-Text• 1B• Updatedabout 1 hour ago • 308

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.08985 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.08985 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Papers with Code Trending

MiniCPM-V 4.5 is an 8B multimodal large language model that achieves high efficiency and strong performance through a unified 3D-Resampler architecture, a novel data strategy, and a hybrid reinforcement learning approach. The model reportedly surpasses larger proprietary and open-source benchmarks while significantly reducing GPU memory usage and inference time.

Large Vision-Language Models Get Lost in Attention

arXiv cs.AI

This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.