@charles_irl: A tl;dr for folks who don't care how many warpgroups FA4 devotes to softmax vs MMA loads. Inference is different from t…

X AI KOLs Following Tools

Summary

Explains that inference kernels differ from training, with Flash Attention 4 focusing on changing parallelism across KV and supporting small irregular loads.

A tl;dr for folks who don't care how many warpgroups FA4 devotes to softmax vs MMA loads. Inference is different from training, so kernels look different. Two main classes of improvement: - change what work is done in parallel (eg across KV) - support small, irregular loads https://t.co/5qrPZ3Yv4L
Original Article
View Cached Full Text

Cached at: 06/12/26, 10:58 AM

A tl;dr for folks who don’t care how many warpgroups FA4 devotes to softmax vs MMA loads.

Inference is different from training, so kernels look different.

Two main classes of improvement:

  • change what work is done in parallel (eg across KV)
  • support small, irregular loads https://t.co/5qrPZ3Yv4L

Similar Articles

Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

Reddit r/MachineLearning

Author describes building FlashRT, a CUDA-first inference runtime that rewrites model inference paths with C++/CUDA kernels to address bottlenecks beyond GEMM for small-batch/realtime workloads, achieving significant latency improvements on Jetson Thor and RTX 5090. The article discusses lessons on precision (FP8 helpful, FP4 mixed) and the need to bypass generic runtimes for realtime inference.