Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
Summary
This paper introduces DRoRAE, a method that improves visual tokenization by fusing multi-layer features from pretrained vision encoders rather than relying solely on the last layer. It demonstrates significant improvements in reconstruction and generation quality on ImageNet and establishes a scaling law between fusion capacity and performance.
View Cached Full Text
Cached at: 05/13/26, 04:11 AM
Paper page - Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
Source: https://huggingface.co/papers/2605.10780
Abstract
DRoRAE enhances visual representation by fusing multi-layer features from pretrained vision encoders through adaptive routing and incremental correction, improving reconstruction and generation quality.
Representation autoencodersthat reusefrozen pretrained vision encodersasvisual tokenizershave achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers viaenergy-constrained routingandincremental correction, producing an enriched latent compatible with a frozen pretrained decoder. Athree-phase decoupled trainingstrategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reducesrFIDfrom 0.57 to 0.29 and improvesgeneration FIDfrom 1.74 to 1.65 (withAutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover alog-linear scaling law(R^2{=}0.86) between fusion capacity and reconstruction quality, identifyingrepresentation richnessas a new, predictably scalable dimension forvisual tokenizersanalogous to vocabulary size in NLP.
View arXiv pageView PDFGitHub2Add to collection
Get this paper in your agent:
hf papers read 2605\.10780
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.10780 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.10780 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.10780 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation
This paper proposes DPVR-LF, a modality-asymmetric routing framework for MLLMs that routes vision tokens at their saturation point into a lightweight side branch and performs late fusion, reducing visual computation while maintaining competitive performance.
From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion
This paper introduces a multimodal image fusion method that uses a 1D token interface from a pretrained image tokenizer to enhance global appearance coherence while preserving local details through selective token editing (STE). Experiments on four benchmarks show state-of-the-art performance in both global coherence and local fidelity.
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
HYDRA-X presents a unified multimodal model that integrates image and video tokenization within a single Vision Transformer, achieving strong performance across understanding and generation tasks.
RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space
RepFusion proposes using multimodal large language models as noisy representation encoders for diffusion transformers in text-to-image generation, outperforming traditional denoising approaches.
Video2LoRA: Parametric Video Internalization for Vision-Language Models
This paper introduces Video2LoRA, a method that predicts Low-Rank Adaptation (LoRA) weights directly from video representations, enabling efficient video processing in frozen vision-language models. It reduces visual token load by up to 1500x and query TTFT by 6-80x while maintaining performance on video summarization and captioning benchmarks.