Intermittent random token injection during decoding stage increases LLM diversity without fine-tuning

Reddit r/ArtificialInteligence Papers

Summary

A Harvard research paper introduces Recoding-Decoding (RD), a novel decoding scheme that injects random priming phrases and diverting tokens to tap into an LLM's long-tail knowledge, significantly boosting output diversity without fine-tuning. The method maintains high relevance while mitigating response homogenization, with stronger models showing greater diversity gains.

"A new paper out of Harvard (Luo, King, Puett, Smith) introduces Recoding-Decoding (RD), a decoding scheme that pulls the long tail of an LLM's knowledge into actual outputs by injecting priming phrases and diverting tokens during decoding stage. How RD works: The authors argue that modern LLMs encode an enormous slice of human knowledge, but standard decoding (top-k, nucleus, etc.) only ever pulls from the peak of the conditional distribution. The long tail — heterodox, contrarian, non-Western, weird-but-relevant — sits unused. RD diverts the model off its modal path by: 1) Prepending a random ""priming phrase"" (e.g., **Related to FOOD:**, **Related to SKY:**) 2) Injecting a random 3-letter ""diverting stem"" (Pas, Tib, Mon, …) at the start of each new sentence For example, ""Brainstorm a world history topic"" can now resolve to ""[Pas]ta and the silk road"" or ""[Tib]etan sky burials"" by absorbing the injected tokens of [Pas] and [Tib], instead of generating the dominant answer of ""Age of Enlightenment."" What they found across 50 brainstorm topics + 500 prompts from 5 public datasets that relevance stays around 0.99 but diversity grows almost linearly out to 1,000 runs. They also found that the stronger the LLM (Gemini-3 > GPT-5.1 > GPT-3.5 > DeepSeek-3), the larger RD's lead — because more capable models have more peaked distributions and thus more hidden tail knowledge. Why it matters: The authors frame this as the ""search quest"" problem — picking a wedding dress, a research topic, a startup name, a school for a kid. The goal isn't the correct answer; it's learning the space. Current LLMs are anti-optimized for that, which the paper argues is quietly driving collective homogenization (they cite a striking incident where students using ChatGPT to outline essays turned in nearly identical arguments without ever talking to each other). 📄 Paper: [https://arxiv.org/abs/2603.19519](https://arxiv.org/abs/2603.19519)
Original Article

Similar Articles

Faster LLM Inference via Sequential Monte Carlo

arXiv cs.CL

This paper proposes Sequential Monte Carlo Speculative Decoding (SMC-SD), a method that accelerates LLM inference by replacing token-level rejection in speculative decoding with importance-weighted resampling over draft particles, achieving 2.36× speedup over standard speculative decoding and 5.2× over autoregressive decoding while maintaining 3% accuracy loss.