Tag
This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.