Tag
A technique to accelerate matrix multiplication when M and N are small but K is large, as encountered in MoE routers and small-batch decoding, by decomposing K and running partial GEMMs in parallel. The approach beats PyTorch Inductor on most shapes using a custom Triton kernel.
This blog explores using LLM-guided autotuning to accelerate kernel configuration search in PyTorch's Helion DSL, replacing the slower Likelihood-Free Bayesian Optimization approach.