autotuning

#autotuning

@shreyansh_26: How do you make a matmul fast when M and N are tiny but K is enormous? (MoE routers, small-batch decode.) Decompose-K: …

X AI KOLs Timeline ↗ · 3d ago Cached

A technique to accelerate matrix multiplication when M and N are small but K is large, as encountered in MoE routers and small-batch decoding, by decomposing K and running partial GEMMs in parallel. The approach beats PyTorch Inductor on most shapes using a custom Triton kernel.

0 favorites 0 likes

#autotuning

@PyTorch: Autotuning is the backbone of Helion, PyTorch's DSL for performance portable ML kernels. Currently Helion searches util…

X AI KOLs Following ↗ · 2026-06-18 Cached

This blog explores using LLM-guided autotuning to accelerate kernel configuration search in PyTorch's Helion DSL, replacing the slower Likelihood-Free Bayesian Optimization approach.

0 favorites 0 likes

autotuning

@shreyansh_26: How do you make a matmul fast when M and N are tiny but K is enormous? (MoE routers, small-batch decode.) Decompose-K: …

@PyTorch: Autotuning is the backbone of Helion, PyTorch's DSL for performance portable ML kernels. Currently Helion searches util…

Submit Feedback