Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Hugging Face Daily Papers 06/08/26, 12:00 AM Papers

multimodal large-language-models vision-tokens token-routing late-fusion efficiency asymmetric-routing

Summary

This paper proposes DPVR-LF, a modality-asymmetric routing framework for MLLMs that routes vision tokens at their saturation point into a lightweight side branch and performs late fusion, reducing visual computation while maintaining competitive performance.

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.

Original Article

View Cached Full Text

Cached at: 06/10/26, 09:44 AM

Paper page - Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Source: https://huggingface.co/papers/2606.09131

Abstract

Research reveals that vision and text tokens in multimodal models evolve asynchronously, leading to inefficient computation; a new asymmetric routing framework reduces visual processing overhead while maintaining performance.

Multimodal large language models(MLLMs) commonly inherit the deep, symmetricTransformer backbonedesigned for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a keymodality asymmetry: image andtext tokensdiffer substantially in information density, redundancy, and required reasoning depth. Through alayer-wise analysisof LLaVA-1.5, we observe thatvision tokenstend to saturate in the middle layers. Specifically,text-to-image attentiondecreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereastext tokenscontinue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift inperceptual representationsduring deep task-specific adaptation. Motivated by this, we proposeDual-Path Vision Token Routing(DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation,DPVR-LF(Late-Layer Fusion), routesvision tokensat the saturation point into a one-layertrainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters,DPVR-LFpreserves competitive multimodal performance on standard benchmarks while reducing visual computation in thedeep Transformer stack. The results challenge the conventional assumption thatvision tokensmust traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.09131

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.09131 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.09131 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.09131 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Paper page - Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

Submit Feedback

Similar Articles

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM