Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Hugging Face Daily Papers Papers

Summary

This paper proposes DPVR-LF, a modality-asymmetric routing framework for MLLMs that routes vision tokens at their saturation point into a lightweight side branch and performs late fusion, reducing visual computation while maintaining competitive performance.

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.
Original Article
View Cached Full Text

Cached at: 06/10/26, 09:44 AM

Paper page - Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Source: https://huggingface.co/papers/2606.09131

Abstract

Research reveals that vision and text tokens in multimodal models evolve asynchronously, leading to inefficient computation; a new asymmetric routing framework reduces visual processing overhead while maintaining performance.

Multimodal large language models(MLLMs) commonly inherit the deep, symmetricTransformer backbonedesigned for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a keymodality asymmetry: image andtext tokensdiffer substantially in information density, redundancy, and required reasoning depth. Through alayer-wise analysisof LLaVA-1.5, we observe thatvision tokenstend to saturate in the middle layers. Specifically,text-to-image attentiondecreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereastext tokenscontinue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift inperceptual representationsduring deep task-specific adaptation. Motivated by this, we proposeDual-Path Vision Token Routing(DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation,DPVR-LF(Late-Layer Fusion), routesvision tokensat the saturation point into a one-layertrainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters,DPVR-LFpreserves competitive multimodal performance on standard benchmarks while reducing visual computation in thedeep Transformer stack. The results challenge the conventional assumption thatvision tokensmust traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.09131

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.09131 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.09131 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.09131 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Hugging Face Daily Papers

LoMo proposes a data curation method that reformulates single-modality prompts into interleaved multimodal sequences to improve cross-modal representation alignment in vision-language models, achieving consistent gains on multiple benchmarks.

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Hugging Face Daily Papers

This paper introduces DRoRAE, a method that improves visual tokenization by fusing multi-layer features from pretrained vision encoders rather than relying solely on the last layer. It demonstrates significant improvements in reconstruction and generation quality on ImageNet and establishes a scaling law between fusion capacity and performance.

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Hugging Face Daily Papers

This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

arXiv cs.CL

This paper proposes a query-based cross-modal projector that compresses visual tokens via cross-attention to improve Mamba-based multimodal LLMs, boosting both performance and throughput on vision-language benchmarks while eliminating the need for manual 2D scan order design.