@NVIDIAAI: We took a 30B model and split it in two to write tokens in parallel instead of one at a time. Introducing Nemotron-Labs…
Summary
NVIDIA Research introduces Nemotron-Labs-TwoTower, a diffusion language model that splits a 30B model into two halves for parallel token generation, achieving 2.42× faster generation while retaining 98.7% of original quality.
View Cached Full Text
Cached at: 07/02/26, 02:16 AM
We took a 30B model and split it in two to write tokens in parallel instead of one at a time.
Introducing Nemotron-Labs-TwoTower: a diffusion language model from NVIDIA Research adapted from Nemotron-3-Nano-30B-A3B. Here’s how it works: one half holds the context, the other writes the tokens, with both reusing the pretrained model instead of training a new one from scratch.
We found it kept 98.7% of the original model’s quality at 2.42× faster generation.
Similar Articles
NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone.
NVIDIA released Nemotron-TwoTower-30B-A3B-Base-BF16, a diffusion-based language model that uses block-wise autoregressive diffusion to generate text by iterative denoising of token blocks, achieving 2.42× the generation throughput of the autoregressive baseline while retaining 98.7% of benchmark quality.
@NVIDIAAI: Most language models only generate one token at a time. We just released Nemotron-Labs-Diffusion, a family of diffusion…
NVIDIA released Nemotron-Labs-Diffusion, a family of diffusion language models that generate multiple tokens in parallel, enabling faster inference and better GPU utilization, with sizes from 3B to 14B including vision-language variants.
@LiorOnAI: You now convert any LLM into a faster one without retraining from scratch. NVIDIA just did this to their 30B model. Her…
NVIDIA proposes a method to convert any LLM into a faster one by splitting it into two copies: one frozen for context, the other trained to generate multiple tokens in parallel, achieving 2.4x speedup with ~99% quality retention using only 8% of training data.
Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models
NVIDIA introduces Nemotron-Labs Diffusion, a family of diffusion language models that generate text in parallel and iteratively refine it, offering faster generation and the ability to revise previous tokens.
@ctnzr: We've gone even farther: Nemotron 3 Super is 120B and pretrained on 25T tokens in NVFP4. Nemotron 3 Ultra is ~500B and …
NVIDIA announces Nemotron 3 Super (120B) and Nemotron 3 Ultra (~500B) models, pretrained on 25T tokens using NVFP4 precision, emphasizing accelerated computing and efficiency improvements.