Tag
Improves prefill speeds for k-quants and refactors matrix multiplication for Q4/Q5/Q8 and k-quants in llama.cpp's WebGPU backend.
CODA reparameterizes memory-bound operations in LLM training to fuse them into the matmul epilogue, achieving near state-of-the-art performance with LLM-generated kernels.