@NVIDIAAI: We took a 30B model and split it in two to write tokens in parallel instead of one at a time. Introducing Nemotron-Labs…

X AI KOLs Timeline Models

Summary

NVIDIA Research introduces Nemotron-Labs-TwoTower, a diffusion language model that splits a 30B model into two halves for parallel token generation, achieving 2.42× faster generation while retaining 98.7% of original quality.

We took a 30B model and split it in two to write tokens in parallel instead of one at a time. Introducing Nemotron-Labs-TwoTower: a diffusion language model from NVIDIA Research adapted from Nemotron-3-Nano-30B-A3B. Here’s how it works: one half holds the context, the other writes the tokens, with both reusing the pretrained model instead of training a new one from scratch. We found it kept 98.7% of the original model’s quality at 2.42× faster generation.
Original Article
View Cached Full Text

Cached at: 07/02/26, 02:16 AM

We took a 30B model and split it in two to write tokens in parallel instead of one at a time.

Introducing Nemotron-Labs-TwoTower: a diffusion language model from NVIDIA Research adapted from Nemotron-3-Nano-30B-A3B. Here’s how it works: one half holds the context, the other writes the tokens, with both reusing the pretrained model instead of training a new one from scratch.

We found it kept 98.7% of the original model’s quality at 2.42× faster generation.

Similar Articles