Qwen-Image-VAE-2.0 Technical Report

Hugging Face Daily Papers 05/13/26, 12:00 AM Papers

Summary

Qwen-Image-VAE-2.0 is a high-compression Variational Autoencoder suite that improves reconstruction fidelity and diffusability through enhanced architecture, large-scale training, and semantic alignment strategies.

We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT experiments reveal our models possess superior diffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptional diffusability.

Original Article

View Cached Full Text

Cached at: 05/14/26, 04:16 AM

Paper page - Qwen-Image-VAE-2.0 Technical Report

Source: https://huggingface.co/papers/2605.13565 Authors:

Abstract

We present Qwen-Image-VAE-2.0, a suite of high-compressionVariational Autoencoders(VAEs) that achieve significant advances in both reconstruction fidelity anddiffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuringGlobal Skip Connections(GSC) and expandedlatent channels. Moreover, we scale training to billions of images and incorporate asynthetic rendering engineto improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhancedsemantic alignmentstrategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric andattention-freeencoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstreamDiTexperiments reveal our models possess superiordiffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptionaldiffusability.

View arXiv page View PDF GitHub8 Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.13565 in a model README.md to link it from this page.

Datasets citing this paper1

#### alibabagroup/OmniDoc-TokenBench Viewer• Updatedabout 1 hour ago • 3.04k • 4 • 3

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.13565 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Qwen-Image-VAE-2.0 Technical Report

Paper page - Qwen-Image-VAE-2.0 Technical Report

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

Qwen-Image-2.0 Technical Report

Qwen-Image-2.0 Technical Report (57 minute read)

Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

Understanding VQ-VAE (DALL-E Explained Pt. 1)

Submit Feedback

Similar Articles

Qwen-Image-2.0 Technical Report

Qwen-Image-2.0 Technical Report (57 minute read)

Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

Understanding VQ-VAE (DALL-E Explained Pt. 1)