Qwen-Image-2.0 Technical Report

Hugging Face Daily Papers Papers

Summary

Qwen-Image-2.0 is a new image generation foundation model that unifies high-fidelity synthesis and precise editing using Qwen3-VL and a Multimodal Diffusion Transformer. It excels in text-rich content, multilingual typography, and photorealistic generation.

We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution photorealism, robust instruction following, and efficient deployment, especially in text-rich and compositionally complex scenarios. Qwen-Image-2.0 addresses these challenges by coupling Qwen3-VL as the condition encoder with a Multimodal Diffusion Transformer for joint condition-target modeling, supported by large-scale data curation and a customized multi-stage training pipeline. This enables strong multimodal understanding while preserving flexible generation and editing capabilities. The model supports instructions of up to 1K tokens for generating text-rich content such as slides, posters, infographics, and comics, while significantly improving multilingual text fidelity and typography. It also enhances photorealistic generation with richer details, more realistic textures, and coherent lighting, and follows complex prompts more reliably across diverse styles. Extensive human evaluations show that Qwen-Image-2.0 substantially outperforms previous Qwen-Image models in both generation and editing, marking a step toward more general, reliable, and practical image generation foundation models.
Original Article
View Cached Full Text

Cached at: 05/12/26, 07:30 AM

Paper page - Qwen-Image-2.0 Technical Report

Source: https://huggingface.co/papers/2605.10730 Published on May 11

#2 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

Qwen-Image-2.0 is an advanced image generation model that combines high-fidelity synthesis with precise editing capabilities through a unified framework using Qwen3-VL as condition encoder and Multimodal Diffusion Transformer for joint modeling.

We present Qwen-Image-2.0, an omni-capableimage generation foundation modelthat unifies high-fidelity generation and preciseimage editingwithin a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution photorealism, robustinstruction following, and efficient deployment, especially in text-rich and compositionally complex scenarios. Qwen-Image-2.0 addresses these challenges by coupling Qwen3-VL as thecondition encoderwith aMultimodal Diffusion Transformerforjoint condition-target modeling, supported bylarge-scale data curationand a customizedmulti-stage training pipeline. This enables strong multimodal understanding while preserving flexible generation and editing capabilities. The model supports instructions of up to 1K tokens for generatingtext-rich contentsuch as slides, posters, infographics, and comics, while significantly improving multilingual text fidelity and typography. It also enhancesphotorealistic generationwith richer details, more realistic textures, and coherent lighting, and follows complex prompts more reliably across diverse styles. Extensive human evaluations show that Qwen-Image-2.0 substantially outperforms previous Qwen-Image models in both generation and editing, marking a step toward more general, reliable, and practicalimage generation foundation models.

View arXiv pageView PDFAdd to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.10730 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.10730 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.10730 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Qwen-Image-VAE-2.0 Technical Report

Hugging Face Daily Papers

Qwen-Image-VAE-2.0 is a high-compression Variational Autoencoder suite that improves reconstruction fidelity and diffusability through enhanced architecture, large-scale training, and semantic alignment strategies.

Qwen-Image-Flash: Beyond Objective Design

Hugging Face Daily Papers

This paper investigates training recipes for few-step distillation of visual generative models, using Qwen-Image-2.0 as a case study. It reveals non-obvious behaviors and proposes Qwen-Image-Flash.

Qwen-Image-Agent: Bridging the Context Gap in Real-World Image Generation

Hugging Face Daily Papers

Qwen-Image-Agent proposes a unified agentic framework that addresses the context gap in text-to-image generation by integrating planning, reasoning, searching, and memory mechanisms. It introduces IA-Bench for evaluation and achieves state-of-the-art performance.

Qwen-Image-Flash (26 minute read)

TLDR AI

This paper from Alibaba revisits few-step distillation for visual generative models, focusing on training recipe factors such as data composition, teacher guidance, and task mixture, using Qwen-Image-2.0 as a case study to develop Qwen-Image-Flash.