Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

Hugging Face Daily Papers 06/29/26, 12:00 AM Papers

Summary

This paper proposes Nemotron-Labs-Diffusion-Image, a masked discrete diffusion model for high-resolution text-to-image synthesis, introducing a token-editing mechanism and grouped cross-entropy objective to improve token refinement and training efficiency.

We propose Nemotron-Labs-Diffusion-Image, a state-of-the-art masked discrete diffusion model (MDM) for high-resolution text-to-image synthesis. Compared with prior work on masked image generation, Nemotron-Labs-Diffusion-Image addresses two key challenges. First, unlike continuous diffusion models which progressively refine latent representations across the entire image, standard MDMs lack self-correcting capability because discrete tokens cannot be modified once they are unmasked. Second, although increasing the vocabulary size of discrete image tokenizers improves reconstruction fidelity, it introduces optimization difficulties for generative modeling as the per-token training signal becomes increasingly sparse. To address the first challenge, Nemotron-Labs-Diffusion-Image incorporates a token-editing mechanism that enables the model to dynamically revise already-unmasked tokens during inference, similar to how a sculptor iteratively refines their work. To tackle the second challenge, we propose a Grouped Cross-Entropy (GCE) objective that assigns positive learning signals to tokens neighboring the ground truth in embedding space, thereby alleviating signal sparsity. To further improve training efficiency, we implement a custom fused operator for GCE that significantly reduces VRAM usage in large-vocabulary settings. Experimental results demonstrate that these innovations substantially improve both training efficiency and image fidelity of masked discrete image generators, achieving a score of 0.90 on GenEval, 86.9 on DPG and 10.76 of HPSv3.

Original Article

View Cached Full Text

Cached at: 06/30/26, 03:33 AM

Paper page - Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

Source: https://huggingface.co/papers/2606.29814

Abstract

A masked discrete diffusion model for text-to-image synthesis that addresses limitations in token refinement and training efficiency through novel mechanisms and optimizations.

We propose Nemotron-Labs-Diffusion-Image, a state-of-the-artmasked discrete diffusion model(MDM) for high-resolutiontext-to-image synthesis. Compared with prior work on masked image generation, Nemotron-Labs-Diffusion-Image addresses two key challenges. First, unlike continuous diffusion models which progressively refine latent representations across the entire image, standard MDMs lack self-correcting capability becausediscrete tokenscannot be modified once they are unmasked. Second, although increasing the vocabulary size of discrete image tokenizers improves reconstruction fidelity, it introduces optimization difficulties for generative modeling as the per-token training signal becomes increasingly sparse. To address the first challenge, Nemotron-Labs-Diffusion-Image incorporates atoken-editing mechanismthat enables the model to dynamically revise already-unmasked tokens during inference, similar to how a sculptor iteratively refines their work. To tackle the second challenge, we propose aGrouped Cross-Entropy(GCE) objective that assigns positive learning signals to tokens neighboring the ground truth inembedding space, thereby alleviating signal sparsity. To further improvetraining efficiency, we implement a custom fused operator for GCE that significantly reducesVRAM usagein large-vocabulary settings. Experimental results demonstrate that these innovations substantially improve bothtraining efficiencyandimage fidelityof masked discrete image generators, achieving a score of 0.90 onGenEval, 86.9 onDPGand 10.76 ofHPSv3.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.29814

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.29814 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.29814 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.29814 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

Paper page - Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

Discrete Stochastic Localization for Non-autoregressive Generation

Submit Feedback

Similar Articles

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

Discrete Stochastic Localization for Non-autoregressive Generation