Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation
Summary
This paper proposes DPVR-LF, a modality-asymmetric routing framework for MLLMs that routes vision tokens at their saturation point into a lightweight side branch and performs late fusion, reducing visual computation while maintaining competitive performance.
View Cached Full Text
Cached at: 06/10/26, 09:44 AM
Paper page - Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation
Source: https://huggingface.co/papers/2606.09131
Abstract
Research reveals that vision and text tokens in multimodal models evolve asynchronously, leading to inefficient computation; a new asymmetric routing framework reduces visual processing overhead while maintaining performance.
Multimodal large language models(MLLMs) commonly inherit the deep, symmetricTransformer backbonedesigned for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a keymodality asymmetry: image andtext tokensdiffer substantially in information density, redundancy, and required reasoning depth. Through alayer-wise analysisof LLaVA-1.5, we observe thatvision tokenstend to saturate in the middle layers. Specifically,text-to-image attentiondecreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereastext tokenscontinue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift inperceptual representationsduring deep task-specific adaptation. Motivated by this, we proposeDual-Path Vision Token Routing(DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation,DPVR-LF(Late-Layer Fusion), routesvision tokensat the saturation point into a one-layertrainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters,DPVR-LFpreserves competitive multimodal performance on standard benchmarks while reducing visual computation in thedeep Transformer stack. The results challenge the conventional assumption thatvision tokensmust traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.09131
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.09131 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.09131 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.09131 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
LoMo proposes a data curation method that reformulates single-modality prompts into interleaved multimodal sequences to improve cross-modal representation alignment in vision-language models, achieving consistent gains on multiple benchmarks.
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
This paper studies how audio and visual information flow inside Audio-Visual Large Language Models (AVLLMs), revealing that AVLLMs follow sequential or parallel routing depending on input configuration, and that some tokens can be discarded after information transfer for efficiency.
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
This paper introduces DRoRAE, a method that improves visual tokenization by fusing multi-layer features from pretrained vision encoders rather than relying solely on the last layer. It demonstrates significant improvements in reconstruction and generation quality on ImageNet and establishes a scaling law between fusion capacity and performance.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.
Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM
This paper proposes a query-based cross-modal projector that compresses visual tokens via cross-attention to improve Mamba-based multimodal LLMs, boosting both performance and throughput on vision-language benchmarks while eliminating the need for manual 2D scan order design.