DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders
Summary
DecQ introduces lightweight detail-condensing queries to improve reconstruction and generation in representation autoencoders without disrupting pretrained semantic spaces.
View Cached Full Text
Cached at: 05/22/26, 10:19 AM
Paper page - DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders
Source: https://huggingface.co/papers/2605.22777
Abstract
DecQ enhances representation autoencoders by introducing lightweight queries that improve reconstruction quality and generative performance without disrupting pretrained semantic spaces.
Representation Autoencoders(RAEs) leveragefrozen vision foundation models(VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation inlatent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generativefidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweightdetail-condensing queriesthat extract fine-grained information from intermediate VFM features throughcondenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated withpatch tokensduringgenerative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving bothreconstruction qualityand generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasingPSNRfrom 19.13 dB to 22.76 dB; and (2) forgenerative modeling, DecQ achieves 3.3times faster convergence than RAE, attaining anFIDof 1.41 without guidance and 1.05 with guidance.
View arXiv pageView PDFGitHub4Add to collection
Get this paper in your agent:
hf papers read 2605\.22777
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.22777 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.22777 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.22777 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse
This paper addresses the issue of dimensional collapse in VQ-VAEs, showing that representations often occupy a low-dimensional subspace. It proposes an 'AE Warm-Up' strategy that trains the model as an unquantized autoencoder first, which improves reconstruction quality and increases effective latent dimensionality.
Qwen-Image-VAE-2.0 Technical Report
Qwen-Image-VAE-2.0 is a high-compression Variational Autoencoder suite that improves reconstruction fidelity and diffusability through enhanced architecture, large-scale training, and semantic alignment strategies.
DeSQ: Decomposition-based SPARQL Query Generation
DeSQ is a decomposition-based framework for generating SPARQL queries from natural language questions. It breaks complex questions into atomic constraints, maps them to SPARQL fragments, and assembles them into complete queries, outperforming state-of-the-art on four out of five benchmarks.
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution
RankE introduces an end-to-end post-training framework for discrete text-to-image generation that jointly optimizes both the generator and decoder to address the latent covariate shift problem, improving alignment and fidelity simultaneously.
Understanding VQ-VAE (DALL-E Explained Pt. 1)
An educational blog post explaining the Vector Quantized Variational Autoencoder (VQ-VAE) architecture, a key component of OpenAI's DALL-E image generation model.