@LiorOnAI: You now convert any LLM into a faster one without retraining from scratch. NVIDIA just did this to their 30B model. Her…
Summary
NVIDIA proposes a method to convert any LLM into a faster one by splitting it into two copies: one frozen for context, the other trained to generate multiple tokens in parallel, achieving 2.4x speedup with ~99% quality retention using only 8% of training data.
View Cached Full Text
Cached at: 07/01/26, 10:13 PM
You now convert any LLM into a faster one without retraining from scratch.
NVIDIA just did this to their 30B model. Here’s the trick:
-
Duplicate the model into two copies
-
Freeze one copy, it just reads the prompt and remembers context
-
Train the other copy to write chunks of text at once instead of one word at a time
-
Run them together
The frozen copy barely costs anything (it’s already trained). The new copy only needed ~8% of the original training data to learn the new trick.
Result: 2.4x faster generation, keeping ~99% of the original quality.
NVIDIA AI (@NVIDIAAI): We took a 30B model and split it in two to write tokens in parallel instead of one at a time.
Introducing Nemotron-Labs-TwoTower: a diffusion language model from NVIDIA Research adapted from Nemotron-3-Nano-30B-A3B. Here’s how it works: one half holds the context, the other
Similar Articles
@NVIDIAAI: We took a 30B model and split it in two to write tokens in parallel instead of one at a time. Introducing Nemotron-Labs…
NVIDIA Research introduces Nemotron-Labs-TwoTower, a diffusion language model that splits a 30B model into two halves for parallel token generation, achieving 2.42× faster generation while retaining 98.7% of original quality.
@AlphaSignalAI: You can now boost any LLM's accuracy 2-10x without training it. Most teams improve model accuracy by fine-tuning or swa…
OptiLLM is an open-source proxy that boosts any LLM's accuracy 2-10x by adding extra compute at inference time, using techniques like multi-agent cross-verification and Monte Carlo tree search.
@HowToAI_: NVIDIA has done the impossible and nobody's talking about it. They trained a 12 BILLION parameter LLM in 4-bit precisio…
NVIDIA trained a 12-billion parameter LLM in 4-bit precision using the new NVFP4 format with micro-scaling, achieving near-zero intelligence loss while halving memory usage and tripling arithmetic speed, marking a major breakthrough in efficient AI training.
Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models
Nous Research releases Token Superposition Training (TST), a method that speeds up LLM pre-training by up to 2.5x across models from 270M to 10B parameters, reducing wall-clock time without altering architecture or data.
@hardmaru: The human brain is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLM…
This paper introduces TwELL and Hybrid sparse formats with custom CUDA kernels to efficiently leverage unstructured sparsity in LLMs, achieving over 20% faster training and inference on H100 GPUs while reducing energy and memory usage.