Tag
This blog post analyzes the PivCo-Huffman paper, which introduces 'merge' operations for parallel Huffman decoding, enabling efficient vectorized and GPU-friendly decoding without interleaving overhead.
Speculative decoding, inspired by 1990s CPU branch prediction, is now used by Anthropic, Google, and Meta to speed up LLM inference 2-3x. It uses a small model to guess future tokens and a large model to verify them in parallel, avoiding idle GPU time during decoding.