Tag
This paper proposes that the KV cache in transformers acts as a notebook of memoized conclusions, enabling surgical editing and composition without full recomputation. The method achieves significant latency reductions while preserving decision equivalence across model scales.
Improves prefill speeds for k-quants and refactors matrix multiplication for Q4/Q5/Q8 and k-quants in llama.cpp's WebGPU backend.
Chamath explains the two key phases of AI compute: prefill, which is compute-bound and favors parallel GPUs like Nvidia's, and decode, which is memory-bandwidth bound and depends on scanning previously generated tokens.