@shreyansh_26: How do you make a matmul fast when M and N are tiny but K is enormous? (MoE routers, small-batch decode.) Decompose-K: …

X AI KOLs Timeline Tools

Summary

A technique to accelerate matrix multiplication when M and N are small but K is large, as encountered in MoE routers and small-batch decoding, by decomposing K and running partial GEMMs in parallel. The approach beats PyTorch Inductor on most shapes using a custom Triton kernel.

How do you make a matmul fast when M and N are tiny but K is enormous? (MoE routers, small-batch decode.) Decompose-K: split K, run S partial GEMMs in parallel, sum them, and fold the epilogue into the reduction store. New post: from torch.compile → custom-op autotuning → a hand-written Triton kernel that beats Inductor on 26/28 shapes. Based on the PyTorch Conf talk by @pz_ai1 & Elias Ellison.
Original Article
View Cached Full Text

Cached at: 06/23/26, 04:12 PM

How do you make a matmul fast when M and N are tiny but K is enormous? (MoE routers, small-batch decode.)

Decompose-K: split K, run S partial GEMMs in parallel, sum them, and fold the epilogue into the reduction store.

New post: from torch.compile → custom-op autotuning → a hand-written Triton kernel that beats Inductor on 26/28 shapes.

Based on the PyTorch Conf talk by @pz_ai1 & Elias Ellison.

Similar Articles

@shreyansh_26: https://x.com/shreyansh_26/status/2069125463860302212

X AI KOLs Timeline

This post explains the Decompose-K technique for accelerating skinny large-K matrix multiplications by splitting the K dimension into chunks, running batched matmuls, and summing partials. It provides a PyTorch implementation and benchmarks showing significant speedups over standard torch.compile for bad-shaped matmuls.

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Reddit r/LocalLLaMA

A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.

Optimizing Models to Be Fast at Codegen (8 minute read)

TLDR AI

Morph LLC describes three key techniques—training a speculator on coding output, auto-searching kernels on cheap GPUs, and writing a custom interconnect—to dramatically speed up open models like Qwen and DeepSeek for coding agent workloads, achieving up to 3x speculative decoding speedup and 97-162 tok/s on a $7K GPU.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

Reddit r/LocalLLaMA

Packed Twin Inference (PTI) is a technique that achieves ~2× LLM throughput by running multiple token sequences in a single batch decode, exploiting weight sharing in llama.cpp without needing a draft model or additional VRAM.