Tag
Improves prefill speeds for k-quants and refactors matrix multiplication for Q4/Q5/Q8 and k-quants in llama.cpp's WebGPU backend.
Chamath explains the two key phases of AI compute: prefill, which is compute-bound and favors parallel GPUs like Nvidia's, and decode, which is memory-bandwidth bound and depends on scanning previously generated tokens.