Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
Summary
This paper introduces DRoRAE, a method that improves visual tokenization by fusing multi-layer features from pretrained vision encoders rather than relying solely on the last layer. It demonstrates significant improvements in reconstruction and generation quality on ImageNet and establishes a scaling law between fusion capacity and performance.
View Cached Full Text
Cached at: 05/13/26, 04:11 AM
Paper page - Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
Source: https://huggingface.co/papers/2605.10780
Abstract
DRoRAE enhances visual representation by fusing multi-layer features from pretrained vision encoders through adaptive routing and incremental correction, improving reconstruction and generation quality.
Representation autoencodersthat reusefrozen pretrained vision encodersasvisual tokenizershave achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers viaenergy-constrained routingandincremental correction, producing an enriched latent compatible with a frozen pretrained decoder. Athree-phase decoupled trainingstrategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reducesrFIDfrom 0.57 to 0.29 and improvesgeneration FIDfrom 1.74 to 1.65 (withAutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover alog-linear scaling law(R^2{=}0.86) between fusion capacity and reconstruction quality, identifyingrepresentation richnessas a new, predictably scalable dimension forvisual tokenizersanalogous to vocabulary size in NLP.
View arXiv pageView PDFGitHub2Add to collection
Get this paper in your agent:
hf papers read 2605\.10780
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.10780 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.10780 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.10780 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Reinforcing Multimodal Reasoning Against Visual Degradation
This paper introduces ROMA, an RL fine-tuning framework that enhances the robustness of multimodal large language models against visual degradations like blur and compression artifacts. It achieves this through a dual-forward-pass strategy and specialized regularization techniques, improving performance on reasoning benchmarks without sacrificing accuracy on clean inputs.
Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection
The article introduces Echo-LoRA, a new parameter-efficient fine-tuning method that injects cross-layer representations from deeper source layers into shallow LoRA modules to improve performance without adding inference-time overhead.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
MMCORE introduces a unified multimodal image generation and editing framework that aligns VLM semantic embeddings with diffusion conditioning, achieving state-of-the-art fidelity without costly fusion or from-scratch training.
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.