Tag
A technique to accelerate matrix multiplication when M and N are small but K is large, as encountered in MoE routers and small-batch decoding, by decomposing K and running partial GEMMs in parallel. The approach beats PyTorch Inductor on most shapes using a custom Triton kernel.