@NVIDIAAI: We took a 30B model and split it in two to write tokens in parallel instead of one at a time. Introducing Nemotron-Labs…

X AI KOLs Timeline 07/01/26, 07:00 PM Models

Summary

NVIDIA Research introduces Nemotron-Labs-TwoTower, a diffusion language model that splits a 30B model into two halves for parallel token generation, achieving 2.42× faster generation while retaining 98.7% of original quality.

We took a 30B model and split it in two to write tokens in parallel instead of one at a time. Introducing Nemotron-Labs-TwoTower: a diffusion language model from NVIDIA Research adapted from Nemotron-3-Nano-30B-A3B. Here’s how it works: one half holds the context, the other writes the tokens, with both reusing the pretrained model instead of training a new one from scratch. We found it kept 98.7% of the original model’s quality at 2.42× faster generation.

Original Article

View Cached Full Text

Cached at: 07/02/26, 02:16 AM

We took a 30B model and split it in two to write tokens in parallel instead of one at a time.

Introducing Nemotron-Labs-TwoTower: a diffusion language model from NVIDIA Research adapted from Nemotron-3-Nano-30B-A3B. Here’s how it works: one half holds the context, the other writes the tokens, with both reusing the pretrained model instead of training a new one from scratch.

We found it kept 98.7% of the original model’s quality at 2.42× faster generation.

Similar Articles

NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone.

Reddit r/LocalLLaMA

NVIDIA released Nemotron-TwoTower-30B-A3B-Base-BF16, a diffusion-based language model that uses block-wise autoregressive diffusion to generate text by iterative denoising of token blocks, achieving 2.42× the generation throughput of the autoregressive baseline while retaining 98.7% of benchmark quality.

@NVIDIAAI: Most language models only generate one token at a time. We just released Nemotron-Labs-Diffusion, a family of diffusion…

X AI KOLs Following

NVIDIA released Nemotron-Labs-Diffusion, a family of diffusion language models that generate multiple tokens in parallel, enabling faster inference and better GPU utilization, with sizes from 3B to 14B including vision-language variants.

@LiorOnAI: You now convert any LLM into a faster one without retraining from scratch. NVIDIA just did this to their 30B model. Her…

X AI KOLs Timeline

NVIDIA proposes a method to convert any LLM into a faster one by splitting it into two copies: one frozen for context, the other trained to generate multiple tokens in parallel, achieving 2.4x speedup with ~99% quality retention using only 8% of training data.

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

Hugging Face Blog

NVIDIA introduces Nemotron-Labs Diffusion, a family of diffusion language models that generate text in parallel and iteratively refine it, offering faster generation and the ability to revise previous tokens.

@ctnzr: We've gone even farther: Nemotron 3 Super is 120B and pretrained on 25T tokens in NVFP4. Nemotron 3 Ultra is ~500B and …