Intermittent random token injection during decoding stage increases LLM diversity without fine-tuning
Summary
A Harvard research paper introduces Recoding-Decoding (RD), a novel decoding scheme that injects random priming phrases and diverting tokens to tap into an LLM's long-tail knowledge, significantly boosting output diversity without fine-tuning. The method maintains high relevance while mitigating response homogenization, with stronger models showing greater diversity gains.
Similar Articles
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
R²-dLLM introduces spatio-temporal redundancy reduction techniques that cut diffusion LLM decoding steps by up to 75% while preserving generation quality, addressing a key deployment bottleneck.
Faster LLM Inference via Sequential Monte Carlo
This paper proposes Sequential Monte Carlo Speculative Decoding (SMC-SD), a method that accelerates LLM inference by replacing token-level rejection in speculative decoding with importance-weighted resampling over draft particles, achieving 2.36× speedup over standard speculative decoding and 5.2× over autoregressive decoding while maintaining 3% accuracy loss.
Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs
This paper introduces a validity-diversity framework attributing diversity collapse in LLMs to order and shape miscalibration during decoding, validated across 14 language models.
@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…
Researchers introduced DFlash, a technique using block diffusion models for speculative decoding that accelerates LLM inference by up to 8.5x without accuracy loss. It is already integrated with major frameworks like vLLM and SGLang.
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
This paper introduces LEAP, a training-free method to accelerate inference in Diffusion Language Models (dLLMs) by detecting early-converging tokens, reducing denoising steps by 30% without losing accuracy.