Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Hugging Face Daily Papers 06/10/26, 12:00 AM Papers

vision-language-models token-reduction attention-mechanism recoverable-routing grounding kv-cache efficiency

Summary

Proposes Reroute, a training-free plug-in for vision-language models that replaces irreversible visual-token pruning with recoverable routing, allowing tokens to re-enter the pipeline later to improve grounding under aggressive token reduction while maintaining VQA performance.

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: https://github.com/elmma/mllm-reroute/

Original Article

View Cached Full Text

Cached at: 06/11/26, 05:35 PM

Paper page - Reroute, Don’t Remove: Recoverable Visual Token Routing for Vision-Language Models

Source: https://huggingface.co/papers/2606.12412

Abstract

Vision-language models can improve grounding performance under aggressive token reduction by replacing irreversible visual-token pruning with recoverable routing that allows tokens to re-enter the processing pipeline at later stages.

Vision-language models(VLMs) project images into hundreds to thousands ofvisual tokens, makingdecoder inferenceexpensive in bothattention computationandKV-cache memory. Existingvisual-token reductionmethods largely follow arank-and-remove paradigm: they scorevisual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially forgrounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass throughdecoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existingattention-score rankingrules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressivetoken reductionwhile maintaining general VQA performance. These results suggest that VLMtoken reductionshould not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: https://github.com/elmma/mllm-reroute/

View arXiv page View PDF GitHub7 Add to collection

Get this paper in your agent:

hf papers read 2606\.12412

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.12412 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.12412 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.12412 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Paper page - Reroute, Don’t Remove: Recoverable Visual Token Routing for Vision-Language Models

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Submit Feedback

Similar Articles

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR