Qwen-Image-VAE-2.0 Technical Report
Summary
Qwen-Image-VAE-2.0 is a high-compression Variational Autoencoder suite that improves reconstruction fidelity and diffusability through enhanced architecture, large-scale training, and semantic alignment strategies.
View Cached Full Text
Cached at: 05/14/26, 04:16 AM
Paper page - Qwen-Image-VAE-2.0 Technical Report
Source: https://huggingface.co/papers/2605.13565 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Qwen-Image-VAE-2.0 is a high-compression Variational Autoencoder suite that improves reconstruction fidelity and diffusability through enhanced architecture, large-scale training, and semantic alignment strategies.
We present Qwen-Image-VAE-2.0, a suite of high-compressionVariational Autoencoders(VAEs) that achieve significant advances in both reconstruction fidelity anddiffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuringGlobal Skip Connections(GSC) and expandedlatent channels. Moreover, we scale training to billions of images and incorporate asynthetic rendering engineto improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhancedsemantic alignmentstrategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric andattention-freeencoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstreamDiTexperiments reveal our models possess superiordiffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptionaldiffusability.
View arXiv pageView PDFGitHub8Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.13565 in a model README.md to link it from this page.
Datasets citing this paper1
#### alibabagroup/OmniDoc-TokenBench Viewer• Updatedabout 1 hour ago • 3.04k • 4 • 3
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.13565 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Qwen-Image-2.0 Technical Report
Qwen-Image-2.0 is a new image generation foundation model that unifies high-fidelity synthesis and precise editing using Qwen3-VL and a Multimodal Diffusion Transformer. It excels in text-rich content, multilingual typography, and photorealistic generation.
Qwen-Image-2.0 Technical Report (57 minute read)
This technical report presents Qwen-Image-2.0, a new image generation model from Alibaba's Qwen team, detailing its architecture and capabilities.
Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse
This paper addresses the issue of dimensional collapse in VQ-VAEs, showing that representations often occupy a low-dimensional subspace. It proposes an 'AE Warm-Up' strategy that trains the model as an unquantized autoencoder first, which improves reconstruction quality and increases effective latent dimensionality.
ViQ: Text-Aligned Visual Quantized Representations at Any Resolution
ViQ presents a visual quantization framework that balances semantic richness and detail preservation in discrete representations, enabling efficient multimodal training with native-resolution inputs by using text-aligned pre-training and proximal representation learning.
Understanding VQ-VAE (DALL-E Explained Pt. 1)
An educational blog post explaining the Vector Quantized Variational Autoencoder (VQ-VAE) architecture, a key component of OpenAI's DALL-E image generation model.