DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

Hugging Face Daily Papers Papers

Summary

DecQ introduces lightweight detail-condensing queries to improve reconstruction and generation in representation autoencoders without disrupting pretrained semantic spaces.

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3times faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.
Original Article
View Cached Full Text

Cached at: 05/22/26, 10:19 AM

Paper page - DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

Source: https://huggingface.co/papers/2605.22777

Abstract

DecQ enhances representation autoencoders by introducing lightweight queries that improve reconstruction quality and generative performance without disrupting pretrained semantic spaces.

Representation Autoencoders(RAEs) leveragefrozen vision foundation models(VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation inlatent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generativefidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweightdetail-condensing queriesthat extract fine-grained information from intermediate VFM features throughcondenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated withpatch tokensduringgenerative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving bothreconstruction qualityand generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasingPSNRfrom 19.13 dB to 22.76 dB; and (2) forgenerative modeling, DecQ achieves 3.3times faster convergence than RAE, attaining anFIDof 1.41 without guidance and 1.05 with guidance.

View arXiv pageView PDFGitHub4Add to collection

Get this paper in your agent:

hf papers read 2605\.22777

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.22777 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.22777 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.22777 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

arXiv cs.LG

This paper addresses the issue of dimensional collapse in VQ-VAEs, showing that representations often occupy a low-dimensional subspace. It proposes an 'AE Warm-Up' strategy that trains the model as an unquantized autoencoder first, which improves reconstruction quality and increases effective latent dimensionality.

Qwen-Image-VAE-2.0 Technical Report

Hugging Face Daily Papers

Qwen-Image-VAE-2.0 is a high-compression Variational Autoencoder suite that improves reconstruction fidelity and diffusability through enhanced architecture, large-scale training, and semantic alignment strategies.

DeSQ: Decomposition-based SPARQL Query Generation

arXiv cs.CL

DeSQ is a decomposition-based framework for generating SPARQL queries from natural language questions. It breaks complex questions into atomic constraints, maps them to SPARQL fragments, and assembles them into complete queries, outperforming state-of-the-art on four out of five benchmarks.

Understanding VQ-VAE (DALL-E Explained Pt. 1)

ML at Berkeley

An educational blog post explaining the Vector Quantized Variational Autoencoder (VQ-VAE) architecture, a key component of OpenAI's DALL-E image generation model.