@NVIDIAAI: Most language models only generate one token at a time. We just released Nemotron-Labs-Diffusion, a family of diffusion…

X AI KOLs Following 05/19/26, 11:58 PM Models

Summary

NVIDIA released Nemotron-Labs-Diffusion, a family of diffusion language models that generate multiple tokens in parallel, enabling faster inference and better GPU utilization, with sizes from 3B to 14B including vision-language variants.

Most language models only generate one token at a time. We just released Nemotron-Labs-Diffusion, a family of diffusion language models that take a different approach, generating multiple tokens in parallel within a single model. Rather than committing to each token permanently, these models can revise as they go, resulting in faster inference that better utilizes modern GPUs. The full model family ranges from 3B to 14B, including vision-language variants. Available now: https://nvda.ws/4tEnTxP

Original Article

View Cached Full Text

Cached at: 05/20/26, 10:36 PM

Most language models only generate one token at a time.

We just released Nemotron-Labs-Diffusion, a family of diffusion language models that take a different approach, generating multiple tokens in parallel within a single model. Rather than committing to each token permanently, these models can revise as they go, resulting in faster inference that better utilizes modern GPUs.

The full model family ranges from 3B to 14B, including vision-language variants. Available now: https://nvda.ws/4tEnTxP

nvidia/Nemotron-Labs-Diffusion-14B · Hugging Face

Source: https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#model-overviewModel Overview

Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model.

An illustration of Tri-Mode LMs

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#highlightsHighlights

SOTA 3B, 8B, 14B dense LM family (base, instruct, and vision-language variants) supporting AR, diffusion, and self-speculation with the focus on decode efficiency.
Generation moved from a memory-bound regime toward a compute-bound regime. Model weights are loaded once and reused to compute multiple tokens during generation.
Self-speculation uses diffusion for drafting and AR for verification, providing a stronger alternative to MTP approaches:- 3x higher acceptance length and 2.2x speed-up vs. Qwen3-8B-Eagle3 in SGLang. - 5.9× tokens per forward over Qwen3-8B (no MTP) with the same accuracy.
Real-device speed-up across platforms:- DGX Spark (8B, concurrency 1): 2.7x faster with 112 tok/sec vs. 41.8 tok/sec AR using w4a16. - GB200 (8B, concurrency 1): 3.3x faster with 850 tok/sec vs. 253 tok/sec AR and 360 tok/sec Eagle3. Custom CUDA kernels boost to 1015 tok/sec (4x).
Diffusion speedup-of-light analysis shows that throughput can be further doubled (vs. current best) for a single user with better sampling - future research.

Efficiency Results

Acc Results

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#licenseterms-of-useLicense/Terms of Use

Use of this model is governed by theNVIDIA Nemotron Open Model License.

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#environmentEnvironment

transformers>=5.0.0

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#chat-with-our-modelChat with Our Model

from transformers import AutoModel, AutoTokenizer
import torch

repo_name = "nvidia/Nemotron-Labs-Diffusion-14B"

tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_name, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)

history = []

user_input = input("User: ").strip()
history.append({"role": "user", "content": user_input})

prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
prompt_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device='cuda')

## Chat in AR Mode
out_ids, nfe = model.ar_generate(inputs.input_ids, max_new_tokens=512)

## Chat in dLM Mode
out_ids, nfe = model.generate(prompt_ids, max_new_tokens=512, block_length=32, threshold=0.9, eos_token_id=tokenizer.eos_token_id)

## Chat in Linear Self-Speculation Mode
out_ids, nfe = model.linear_spec_generate(prompt_ids, max_new_tokens=512, block_length=32, eos_token_id=tokenizer.eos_token_id)

tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)[0]
print(f"Model: {tokenized_out}")
print(f"[Num Function Eval (NFE)={nfe}]")

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#inference-with-linear-self-speculation–lora-enhanced-drafterInference with Linear Self-Speculation + LoRA-enhanced Drafter

An optional LoRA adatper can be applied to the diffusion drafter in the linear self-speculation mode to further increase the acceptance length:

import torch
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel

repo = "nvidia/Nemotron-Labs-Diffusion-14B"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModel.from_pretrained(repo, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)

# Attach the linear_spec LoRA adapter.
model = PeftModel.from_pretrained(model, repo, subfolder="linear_spec_lora").eval()
# Unwrap so we can call linear_spec_generate directly (it toggles LoRA internally).
base = model.model

history = [{"role": "user", "content": "Solve: What is 15% of 240?"}]
prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

out_ids, nfe = base.linear_spec_generate(
    prompt_ids, max_new_tokens=512, block_length=32,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(out_ids[0, prompt_ids.shape[1]:], skip_special_tokens=True))
print(f"[NFE={nfe}]")

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#ethical-considerationsEthical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see thebias,explainability,safety & security, andprivacysubcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concernshere.

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#citationsCitations

@techreport{fu2026nemotronlabsdiffusion,
  title       = {Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding},
  author      = {Yonggan Fu and Lexington Whalen and Abhinav Garg and Chengyue Wu and Maksim Khadkevich and Nicolai Oswald and Enze Xie and Daniel Egert and Sharath Turuvekere Sreenivas and Shizhe Diao and Chenhan Yu and Ye Yu and Weijia Chen and Sajad Norouzi and Shiyi Lan and Ligeng Zhu and Jin Wang and Jindong Jiang and Morteza Mardani and Mehran Maghoumi and Song Han and Ante Jukic and Nima Tajbakhsh and Jan Kautz and Pavlo Molchanov},
  institution = {NVIDIA},
  year        = {2026},
  note        = {Technical report}
}

@NVIDIAAI: Most language models only generate one token at a time. We just released Nemotron-Labs-Diffusion, a family of diffusion…

nvidia/Nemotron-Labs-Diffusion-14B · Hugging Face

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#model-overviewModel Overview

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#highlightsHighlights

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#licenseterms-of-useLicense/Terms of Use

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#environmentEnvironment

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#chat-with-our-modelChat with Our Model

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#inference-with-linear-self-speculation–lora-enhanced-drafterInference with Linear Self-Speculation + LoRA-enhanced Drafter

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#ethical-considerationsEthical Considerations

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B?linkId=100000422877931#citationsCitations

Similar Articles

@NVIDIAAI: We took a 30B model and split it in two to write tokens in parallel instead of one at a time. Introducing Nemotron-Labs…

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone.

@PavloMolchanov: We’re releasing Nemotron-Labs-Diffusion - the first Tri-mode LM family (3B/8B/14B) that switches between Autoregressive…

nvidia/Nemotron-Labs-Diffusion-14B

Submit Feedback

Similar Articles

@NVIDIAAI: We took a 30B model and split it in two to write tokens in parallel instead of one at a time. Introducing Nemotron-Labs…

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone.

@PavloMolchanov: We’re releasing Nemotron-Labs-Diffusion - the first Tri-mode LM family (3B/8B/14B) that switches between Autoregressive…

nvidia/Nemotron-Labs-Diffusion-14B