@shreyansh_26: How do you make a matmul fast when M and N are tiny but K is enormous? (MoE routers, small-batch decode.) Decompose-K: …

X AI KOLs Timeline 06/22/26, 06:31 PM Tools

matmul-optimization moe-routers small-batch-decode decompose-k triton-kernel torch-compile autotuning

Summary

A technique to accelerate matrix multiplication when M and N are small but K is large, as encountered in MoE routers and small-batch decoding, by decomposing K and running partial GEMMs in parallel. The approach beats PyTorch Inductor on most shapes using a custom Triton kernel.

How do you make a matmul fast when M and N are tiny but K is enormous? (MoE routers, small-batch decode.) Decompose-K: split K, run S partial GEMMs in parallel, sum them, and fold the epilogue into the reduction store. New post: from torch.compile → custom-op autotuning → a hand-written Triton kernel that beats Inductor on 26/28 shapes. Based on the PyTorch Conf talk by @pz_ai1 & Elias Ellison.

Original Article

View Cached Full Text

Cached at: 06/23/26, 04:12 PM

How do you make a matmul fast when M and N are tiny but K is enormous? (MoE routers, small-batch decode.)

Decompose-K: split K, run S partial GEMMs in parallel, sum them, and fold the epilogue into the reduction store.

New post: from torch.compile → custom-op autotuning → a hand-written Triton kernel that beats Inductor on 26/28 shapes.

Based on the PyTorch Conf talk by @pz_ai1 & Elias Ellison.

Similar Articles

@shreyansh_26: https://x.com/shreyansh_26/status/2069125463860302212

X AI KOLs Timeline

This post explains the Decompose-K technique for accelerating skinny large-K matrix multiplications by splitting the K dimension into chunks, running batched matmuls, and summing partials. It provides a PyTorch implementation and benchmarks showing significant speedups over standard torch.compile for bad-shaped matmuls.

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Reddit r/LocalLLaMA

A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.

Optimizing Models to Be Fast at Codegen (8 minute read)

TLDR AI

Morph LLC describes three key techniques—training a speculator on coding output, auto-searching kernels on cheap GPUs, and writing a custom interconnect—to dramatically speed up open models like Qwen and DeepSeek for coding agent workloads, achieving up to 3x speculative decoding speedup and 97-162 tok/s on a $7K GPU.

@jun_song: If we ever figure out how to load ONLY the active params of an MoE into the GPU instead of the full weights, it's game …

X AI KOLs Following

The author speculates that loading only active parameters of MoE models onto GPUs could drastically improve efficiency and allow running large models like Kimi locally, though acknowledges this is currently impractical.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.