High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation
Summary
This paper introduces Z-Image Turbo++, a two-step image generation model distilled from an eight-step teacher using distribution-aligned adversarial learning, step-decoupled parameterization, and end-to-end training with iterative regularization to narrow the quality gap with multi-step generation.
View Cached Full Text
Cached at: 06/12/26, 06:50 AM
Paper page - High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation
Source: https://huggingface.co/papers/2606.12575
Abstract
A 2-step image generation model is developed through distillation from an 8-step teacher using distribution-aligned adversarial learning, step-decoupled parameterization, and end-to-end training with iterative regularization.
Few-stepdiffusion distillationhas become increasingly mature for 4-8-step generation, yet pushing further to 2 steps remains challenging. In this work, we introduceZ-Image Turbo++, a high-quality 2-step image generation model distilled from the 8-step Z-Image Turbo teacher. Our method addresses the central bottlenecks of increased task difficulty and limited model capacity in 2-step generation through three simple but effective design choices tailored to this regime. First, we proposeDistribution-Aligned Adversarial Learning, which uses teacher-generated images rather than external real images as real samples for GAN training, providing a more attainable and informative adversarial target. Second, we adoptStep-Decoupled Parameterization, assigning independent model parameters to the twodenoising stepsto better match their distinct capacity demands. Third, we performEnd-to-End TrainingwithIterative Regularization, allowing the first step to receive gradients from final image quality while preserving a meaningful intermediate generation through an explicit step-1 loss. Together, these designs substantially narrow the quality gap between 2-step and 8-step generation in both qualitative and quantitative evaluations, highlighting the potential of carefully tailored distillation strategies for improving the quality-efficiency trade-off in few-step generation.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.12575
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.12575 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.12575 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.12575 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Qwen-Image-Flash (26 minute read)
This paper from Alibaba revisits few-step distillation for visual generative models, focusing on training recipe factors such as data composition, teacher guidance, and task mixture, using Qwen-Image-2.0 as a case study to develop Qwen-Image-Flash.
@HuggingPapers: Alibaba released Qwen-Image-Flash Few-step distillation goes beyond objectives. Data composition, teacher guidance, and…
Alibaba released Qwen-Image-Flash, a few-step distilled model for fast, high-quality text-to-image generation and instruction-guided editing, leveraging data composition, teacher guidance, and task mixture.
Reinforcing Few-step Generators via Reward-Tilted Distribution Matching
RTDMD is a two-stage framework combining distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences. It achieves state-of-the-art results on multiple models with only 4 inference steps.
Qwen-Image-Flash: Beyond Objective Design
This paper investigates training recipes for few-step distillation of visual generative models, using Qwen-Image-2.0 as a case study. It reveals non-obvious behaviors and proposes Qwen-Image-Flash.
@jiqizhixin: What if you could generate high-quality images in one step instead of hundreds? Stanford and ByteDance introduce W-Flow…
Stanford and ByteDance introduce W-Flow, a single-step generative model that uses Wasserstein gradient flows to achieve state-of-the-art one-step ImageNet 256x256 generation (1.29 FID) with 100x faster sampling than multi-step diffusion models.