parallel-generation

Tag

Cards List
#parallel-generation

@LiorOnAI: You now convert any LLM into a faster one without retraining from scratch. NVIDIA just did this to their 30B model. Her…

X AI KOLs Timeline · yesterday Cached

NVIDIA proposes a method to convert any LLM into a faster one by splitting it into two copies: one frozen for context, the other trained to generate multiple tokens in parallel, achieving 2.4x speedup with ~99% quality retention using only 8% of training data.

0 favorites 0 likes
#parallel-generation

@NVIDIAAI: We took a 30B model and split it in two to write tokens in parallel instead of one at a time. Introducing Nemotron-Labs…

X AI KOLs Timeline · yesterday Cached

NVIDIA Research introduces Nemotron-Labs-TwoTower, a diffusion language model that splits a 30B model into two halves for parallel token generation, achieving 2.42× faster generation while retaining 98.7% of original quality.

0 favorites 0 likes
#parallel-generation

Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation

arXiv cs.CL · 2026-06-12 Cached

This paper presents a low-latency real-time audio game commentary system that uses LLM-based parallel text generation to reduce inter-utterance silence from 9.6 to 0.3 seconds, significantly improving perceived speaking rhythm compared to sequential baselines.

0 favorites 0 likes
#parallel-generation

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

NVIDIA Blog · 2026-06-10 Cached

NVIDIA optimizes Google DeepMind's DiffusionGemma, an open model that generates text in parallel 256-token blocks, achieving up to 4x faster performance on local RTX GPUs, DGX Spark, and DGX Station systems.

0 favorites 0 likes
#parallel-generation

DiffusionGemma: The Developer Guide- Google Developers Blog

Reddit r/LocalLLaMA · 2026-06-10 Cached

DiffusionGemma is a new experimental model from Google DeepMind that uses parallel generation on a 256-token canvas, achieving up to 4x faster token generation on GPUs. This developer guide explains its architecture, bidirectional context, and includes a fine-tuning recipe for solving Sudoku.

0 favorites 0 likes
#parallel-generation

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

Hacker News Top · 2026-05-15 Cached

Orthrus is a dual-architecture framework that combines autoregressive LLM fidelity with diffusion model speed, delivering up to 7.8x speedup on Qwen3 models while guaranteeing identical output distribution.

0 favorites 0 likes
#parallel-generation

Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

arXiv cs.CL · 2026-05-15 Cached

This paper introduces FeF-DLLM, a discrete diffusion language model that eliminates factorization errors by using exact prefix-conditioned factorization and accelerates inference via speculative decoding, achieving significant improvements in accuracy and speed on benchmarks such as GSM8K and MATH.

0 favorites 0 likes
#parallel-generation

DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

Hugging Face Daily Papers · 2026-05-08 Cached

This paper introduces DiffRetriever, a method that uses diffusion language models to generate multiple representative tokens in parallel for efficient information retrieval, outperforming autoregressive baselines in speed and accuracy.

0 favorites 0 likes
#parallel-generation

DFlash: Block Diffusion for Flash Speculative Decoding

Papers with Code Trending · 2026-02-05 Cached

DFlash is a new speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving over 6x acceleration compared to autoregressive methods. It significantly outperforms existing state-of-the-art methods like EAGLE-3 while maintaining high output quality.

0 favorites 0 likes
← Back to home

Submit Feedback