Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

Summary

This paper introduces DRoRAE, a method that improves visual tokenization by fusing multi-layer features from pretrained vision encoders rather than relying solely on the last layer. It demonstrates significant improvements in reconstruction and generation quality on ImageNet and establishes a scaling law between fusion capacity and performance.

Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law (R^2{=}0.86) between fusion capacity and reconstruction quality, identifying representation richness as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/13/26, 04:11 AM

Paper page - Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Source: https://huggingface.co/papers/2605.10780

Abstract

DRoRAE enhances visual representation by fusing multi-layer features from pretrained vision encoders through adaptive routing and incremental correction, improving reconstruction and generation quality.

Representation autoencodersthat reusefrozen pretrained vision encodersasvisual tokenizershave achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers viaenergy-constrained routingandincremental correction, producing an enriched latent compatible with a frozen pretrained decoder. Athree-phase decoupled trainingstrategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reducesrFIDfrom 0.57 to 0.29 and improvesgeneration FIDfrom 1.74 to 1.65 (withAutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover alog-linear scaling law(R^2{=}0.86) between fusion capacity and reconstruction quality, identifyingrepresentation richnessas a new, predictably scalable dimension forvisual tokenizersanalogous to vocabulary size in NLP.

View arXiv page View PDF GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2605\.10780

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.10780 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.10780 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.10780 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Paper page - Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Reinforcing Multimodal Reasoning Against Visual Degradation

Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

Submit Feedback

Similar Articles

Reinforcing Multimodal Reasoning Against Visual Degradation

Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction