Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis
Summary
This paper proposes Nemotron-Labs-Diffusion-Image, a masked discrete diffusion model for high-resolution text-to-image synthesis, introducing a token-editing mechanism and grouped cross-entropy objective to improve token refinement and training efficiency.
View Cached Full Text
Cached at: 06/30/26, 03:33 AM
Paper page - Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis
Source: https://huggingface.co/papers/2606.29814
Abstract
A masked discrete diffusion model for text-to-image synthesis that addresses limitations in token refinement and training efficiency through novel mechanisms and optimizations.
We propose Nemotron-Labs-Diffusion-Image, a state-of-the-artmasked discrete diffusion model(MDM) for high-resolutiontext-to-image synthesis. Compared with prior work on masked image generation, Nemotron-Labs-Diffusion-Image addresses two key challenges. First, unlike continuous diffusion models which progressively refine latent representations across the entire image, standard MDMs lack self-correcting capability becausediscrete tokenscannot be modified once they are unmasked. Second, although increasing the vocabulary size of discrete image tokenizers improves reconstruction fidelity, it introduces optimization difficulties for generative modeling as the per-token training signal becomes increasingly sparse. To address the first challenge, Nemotron-Labs-Diffusion-Image incorporates atoken-editing mechanismthat enables the model to dynamically revise already-unmasked tokens during inference, similar to how a sculptor iteratively refines their work. To tackle the second challenge, we propose aGrouped Cross-Entropy(GCE) objective that assigns positive learning signals to tokens neighboring the ground truth inembedding space, thereby alleviating signal sparsity. To further improvetraining efficiency, we implement a custom fused operator for GCE that significantly reducesVRAM usagein large-vocabulary settings. Experimental results demonstrate that these innovations substantially improve bothtraining efficiencyandimage fidelityof masked discrete image generators, achieving a score of 0.90 onGenEval, 86.9 onDPGand 10.76 ofHPSv3.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.29814
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.29814 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.29814 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.29814 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models
NVIDIA introduces Nemotron-Labs Diffusion, a family of diffusion language models that generate text in parallel and iteratively refine it, offering faster generation and the ability to revise previous tokens.
Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding
Set Diffusion introduces a new class of language models that interpolates between autoregressive and diffusion models by factorizing token generation over flexible-position, flexible-length token sets. This enables faster decoding and flexible token ordering, achieving better speed-quality tradeoffs on reasoning, summarization, and unconditional generation tasks.
MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training
MaskAlign proposes a token-subset representation alignment method that improves diffusion transformer training by reducing reliance on complete token sets and maintaining stable alignment under perturbations.
Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
This paper proposes Decoupled Residual Denoising Diffusion Models (DRDD) for unified and data-efficient image-to-image translation, decoupling noise diffusion for domain harmonization from residual diffusion for semantic mapping.
Discrete Stochastic Localization for Non-autoregressive Generation
Introduces Discrete Stochastic Localization (DSL), a continuous-state diffusion framework for non-autoregressive text generation that uses unit-sphere token embeddings and a timestep-invariant denoiser, achieving better distributional faithfulness than masked discrete diffusion models on OpenWebText.