@LiorOnAI: You now convert any LLM into a faster one without retraining from scratch. NVIDIA just did this to their 30B model. Her…

X AI KOLs Timeline Models

Summary

NVIDIA proposes a method to convert any LLM into a faster one by splitting it into two copies: one frozen for context, the other trained to generate multiple tokens in parallel, achieving 2.4x speedup with ~99% quality retention using only 8% of training data.

You now convert any LLM into a faster one without retraining from scratch. NVIDIA just did this to their 30B model. Here's the trick: 1. Duplicate the model into two copies 2. Freeze one copy, it just reads the prompt and remembers context 3. Train the other copy to write chunks of text at once instead of one word at a time 4. Run them together The frozen copy barely costs anything (it's already trained). The new copy only needed ~8% of the original training data to learn the new trick. Result: 2.4x faster generation, keeping ~99% of the original quality.
Original Article
View Cached Full Text

Cached at: 07/01/26, 10:13 PM

You now convert any LLM into a faster one without retraining from scratch.

NVIDIA just did this to their 30B model. Here’s the trick:

  1. Duplicate the model into two copies

  2. Freeze one copy, it just reads the prompt and remembers context

  3. Train the other copy to write chunks of text at once instead of one word at a time

  4. Run them together

The frozen copy barely costs anything (it’s already trained). The new copy only needed ~8% of the original training data to learn the new trick.

Result: 2.4x faster generation, keeping ~99% of the original quality.

NVIDIA AI (@NVIDIAAI): We took a 30B model and split it in two to write tokens in parallel instead of one at a time.

Introducing Nemotron-Labs-TwoTower: a diffusion language model from NVIDIA Research adapted from Nemotron-3-Nano-30B-A3B. Here’s how it works: one half holds the context, the other

Similar Articles