small-batch-decode

#small-batch-decode

@shreyansh_26: How do you make a matmul fast when M and N are tiny but K is enormous? (MoE routers, small-batch decode.) Decompose-K: …

X AI KOLs Timeline ↗ · 3d ago Cached

A technique to accelerate matrix multiplication when M and N are small but K is large, as encountered in MoE routers and small-batch decoding, by decomposing K and running partial GEMMs in parallel. The approach beats PyTorch Inductor on most shapes using a custom Triton kernel.

0 favorites 0 likes

small-batch-decode

@shreyansh_26: How do you make a matmul fast when M and N are tiny but K is enormous? (MoE routers, small-batch decode.) Decompose-K: …

Submit Feedback