Intermittent random token injection during decoding stage increases LLM diversity without fine-tuning

Reddit r/ArtificialInteligence Papers

Summary

A Harvard research paper introduces Recoding-Decoding (RD), a novel decoding scheme that injects random priming phrases and diverting tokens to tap into an LLM's long-tail knowledge, significantly boosting output diversity without fine-tuning. The method maintains high relevance while mitigating response homogenization, with stronger models showing greater diversity gains.

"A new paper out of Harvard (Luo, King, Puett, Smith) introduces Recoding-Decoding (RD), a decoding scheme that pulls the long tail of an LLM's knowledge into actual outputs by injecting priming phrases and diverting tokens during decoding stage. How RD works: The authors argue that modern LLMs encode an enormous slice of human knowledge, but standard decoding (top-k, nucleus, etc.) only ever pulls from the peak of the conditional distribution. The long tail — heterodox, contrarian, non-Western, weird-but-relevant — sits unused. RD diverts the model off its modal path by: 1) Prepending a random ""priming phrase"" (e.g., **Related to FOOD:**, **Related to SKY:**) 2) Injecting a random 3-letter ""diverting stem"" (Pas, Tib, Mon, …) at the start of each new sentence For example, ""Brainstorm a world history topic"" can now resolve to ""[Pas]ta and the silk road"" or ""[Tib]etan sky burials"" by absorbing the injected tokens of [Pas] and [Tib], instead of generating the dominant answer of ""Age of Enlightenment."" What they found across 50 brainstorm topics + 500 prompts from 5 public datasets that relevance stays around 0.99 but diversity grows almost linearly out to 1,000 runs. They also found that the stronger the LLM (Gemini-3 > GPT-5.1 > GPT-3.5 > DeepSeek-3), the larger RD's lead — because more capable models have more peaked distributions and thus more hidden tail knowledge. Why it matters: The authors frame this as the ""search quest"" problem — picking a wedding dress, a research topic, a startup name, a school for a kid. The goal isn't the correct answer; it's learning the space. Current LLMs are anti-optimized for that, which the paper argues is quietly driving collective homogenization (they cite a striking incident where students using ChatGPT to outline essays turned in nearly identical arguments without ever talking to each other). 📄 Paper: [https://arxiv.org/abs/2603.19519](https://arxiv.org/abs/2603.19519)
Original Article

Similar Articles

PARTREP: Learning What to Repeat for Decoder-only LLMs

arXiv cs.CL

PartRep proposes a selective prompt repetition method for decoder-only LLMs that appends only the most informative tokens (selected via NLL) instead of the full prompt, reducing KV cache and prefill FLOPs while retaining most of the accuracy gains across multiple benchmarks.

Speculative Refinement: A Hybrid Autoregressive Diffusion Decoding Strategy and Its Behavior Across Benchmarks

arXiv cs.AI

Introduces Speculative Refinement (SpecRef), a training-free hybrid decoding strategy that warm-starts a masked diffusion language model from an autoregressive draft using entropy-guided selective masking. Evaluated across six benchmarks, it reveals that code benchmarks conflate structural discovery with logical correctness, identifies a refinement tension phenomenon, and shows that evaluation protocols can produce different model rankings.

Faster LLM Inference via Sequential Monte Carlo

arXiv cs.CL

This paper proposes Sequential Monte Carlo Speculative Decoding (SMC-SD), a method that accelerates LLM inference by replacing token-level rejection in speculative decoding with importance-weighted resampling over draft particles, achieving 2.36× speedup over standard speculative decoding and 5.2× over autoregressive decoding while maintaining 3% accuracy loss.