Tag
NVIDIA proposes a method to convert any LLM into a faster one by splitting it into two copies: one frozen for context, the other trained to generate multiple tokens in parallel, achieving 2.4x speedup with ~99% quality retention using only 8% of training data.
NVIDIA Research introduces Nemotron-Labs-TwoTower, a diffusion language model that splits a 30B model into two halves for parallel token generation, achieving 2.42× faster generation while retaining 98.7% of original quality.
This paper presents a low-latency real-time audio game commentary system that uses LLM-based parallel text generation to reduce inter-utterance silence from 9.6 to 0.3 seconds, significantly improving perceived speaking rhythm compared to sequential baselines.
NVIDIA optimizes Google DeepMind's DiffusionGemma, an open model that generates text in parallel 256-token blocks, achieving up to 4x faster performance on local RTX GPUs, DGX Spark, and DGX Station systems.
DiffusionGemma is a new experimental model from Google DeepMind that uses parallel generation on a 256-token canvas, achieving up to 4x faster token generation on GPUs. This developer guide explains its architecture, bidirectional context, and includes a fine-tuning recipe for solving Sudoku.
Orthrus is a dual-architecture framework that combines autoregressive LLM fidelity with diffusion model speed, delivering up to 7.8x speedup on Qwen3 models while guaranteeing identical output distribution.
This paper introduces FeF-DLLM, a discrete diffusion language model that eliminates factorization errors by using exact prefix-conditioned factorization and accelerates inference via speculative decoding, achieving significant improvements in accuracy and speed on benchmarks such as GSM8K and MATH.
This paper introduces DiffRetriever, a method that uses diffusion language models to generate multiple representative tokens in parallel for efficient information retrieval, outperforming autoregressive baselines in speed and accuracy.
DFlash is a new speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving over 6x acceleration compared to autoregressive methods. It significantly outperforms existing state-of-the-art methods like EAGLE-3 while maintaining high output quality.