Qwen-Image-VAE-2.0 Technical Report

Hugging Face Daily Papers Papers

Summary

Qwen-Image-VAE-2.0 is a high-compression Variational Autoencoder suite that improves reconstruction fidelity and diffusability through enhanced architecture, large-scale training, and semantic alignment strategies.

We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT experiments reveal our models possess superior diffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptional diffusability.
Original Article
View Cached Full Text

Cached at: 05/14/26, 04:16 AM

Paper page - Qwen-Image-VAE-2.0 Technical Report

Source: https://huggingface.co/papers/2605.13565 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

Qwen-Image-VAE-2.0 is a high-compression Variational Autoencoder suite that improves reconstruction fidelity and diffusability through enhanced architecture, large-scale training, and semantic alignment strategies.

We present Qwen-Image-VAE-2.0, a suite of high-compressionVariational Autoencoders(VAEs) that achieve significant advances in both reconstruction fidelity anddiffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuringGlobal Skip Connections(GSC) and expandedlatent channels. Moreover, we scale training to billions of images and incorporate asynthetic rendering engineto improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhancedsemantic alignmentstrategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric andattention-freeencoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstreamDiTexperiments reveal our models possess superiordiffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptionaldiffusability.

View arXiv pageView PDFGitHub8Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.13565 in a model README.md to link it from this page.

Datasets citing this paper1

#### alibabagroup/OmniDoc-TokenBench Viewer• Updatedabout 1 hour ago • 3.04k • 4 • 3

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.13565 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Qwen-Image-2.0 Technical Report

Hugging Face Daily Papers

Qwen-Image-2.0 is a new image generation foundation model that unifies high-fidelity synthesis and precise editing using Qwen3-VL and a Multimodal Diffusion Transformer. It excels in text-rich content, multilingual typography, and photorealistic generation.

Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

arXiv cs.LG

This paper addresses the issue of dimensional collapse in VQ-VAEs, showing that representations often occupy a low-dimensional subspace. It proposes an 'AE Warm-Up' strategy that trains the model as an unquantized autoencoder first, which improves reconstruction quality and increases effective latent dimensionality.

ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

Hugging Face Daily Papers

ViQ presents a visual quantization framework that balances semantic richness and detail preservation in discrete representations, enabling efficient multimodal training with native-resolution inputs by using text-aligned pre-training and proximal representation learning.

Understanding VQ-VAE (DALL-E Explained Pt. 1)

ML at Berkeley

An educational blog post explaining the Vector Quantized Variational Autoencoder (VQ-VAE) architecture, a key component of OpenAI's DALL-E image generation model.