Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Reddit r/LocalLLaMA 04/22/26, 12:15 AM Tools

Summary

MoonshotAI released FlashKDA, open-source CUTLASS kernels for Kimi Delta Attention that deliver up to 2.22x speedup over Triton on H20 GPUs.

[github.com/MoonshotAI/FlashKDA](http://github.com/MoonshotAI/FlashKDA) Been comparing how different routing layers handle K2.6 this week, OpenRouter, Together, Orq, and while digging around I came across FlashKDA which Moonshot dropped alongside the K2.6 activity. Seems to be flying under the radar, sharing here because the kernel work is genuinely interesting on its own, separate from the model release. What it is. A CUTLASS C++ implementation of the forward kernel for Kimi Delta Attention, the linear attention variant from the Kimi Linear paper. It plugs into flash-linear-attention as a backend through FLA pull request #852, so anyone already using FLA for KDA based models can route through FlashKDA at the backend layer. Numbers from their H20 benchmark, measured against FLA's existing Triton path: At T=8192, H=96, D=128, fixed length sequences, 1.72x. Variable length with mixed seq\_lens, 1.95x. Variable length with uniform 1024x8, 2.22x. Why this matters. Linear attention architectures like KDA promise linear scaling with sequence length, but the promise only holds if the kernel implementation is actually hardware efficient. FLA's Triton path is the reference and it works, but CUTLASS tuned for Hopper memory access patterns is how you close the gap between the theoretical cost model and what you see on a real GPU. Requirements are SM90 and above, CUDA 12.9 and above, PyTorch 2.4 and above. MIT licensed. One honest limitation worth flagging, the benchmark is forward pass only and all numbers are on H20. H20 is the China specific Hopper variant so absolute numbers on H100 or Blackwell will differ. The relative speedup should be directionally similar but nobody has posted those numbers yet. Curious whether anyone on here has tested it on H100, or has thoughts on when a backward pass kernel might land. The forward only story limits the training use case right now.

Original Article

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Similar Articles

@Kimi_Moonshot: We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achie…

@HotAisle: Kimi K2.6 + DFlash: 508 tok/s on 8x MI300X 5.6x throughput improvement over baseline autoregressive serving 90 tok/s → …

@AdinaYakup: Kimi 2.6 is now available on @huggingface https://huggingface.co/moonshotai/Kimi-K2.6… 1T MoE / 32B active / 256K conte…

@QuixiAI: @Kimi_Moonshot K2.6 running on my mi300x, 56 tps (single request). I will run a throughput test

@gnotuy: We open sourced Kimi K2.6. The next frontier in test-time compute isn't bigger models. It's better organizations of int…

Submit Feedback

Similar Articles

@Kimi_Moonshot: We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achie…

@HotAisle: Kimi K2.6 + DFlash: 508 tok/s on 8x MI300X 5.6x throughput improvement over baseline autoregressive serving 90 tok/s → …

@AdinaYakup: Kimi 2.6 is now available on @huggingface https://huggingface.co/moonshotai/Kimi-K2.6… 1T MoE / 32B active / 256K conte…

@QuixiAI: @Kimi_Moonshot K2.6 running on my mi300x, 56 tps (single request). I will run a throughput test
Kimi K2.6 achieves 56 tokens per second on a single MI300X GPU; user plans further throughput benchmarking.

@gnotuy: We open sourced Kimi K2.6. The next frontier in test-time compute isn't bigger models. It's better organizations of int…