Tag
This post explains the Decompose-K technique for accelerating skinny large-K matrix multiplications by splitting the K dimension into chunks, running batched matmuls, and summing partials. It provides a PyTorch implementation and benchmarks showing significant speedups over standard torch.compile for bad-shaped matmuls.