@Kimi_Moonshot: We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achie…

X AI KOLs Following 04/21/26, 03:12 PM Tools

Summary

Moonshot AI releases FlashKDA, an open-source CUTLASS-based implementation of Kimi Delta Attention kernels that delivers 1.72×–2.22× prefill speedup on H20 GPUs.

We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github:

Original Article

Similar Articles

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Reddit r/LocalLLaMA

MoonshotAI released FlashKDA, open-source CUTLASS kernels for Kimi Delta Attention that deliver up to 2.22x speedup over Triton on H20 GPUs.

@HotAisle: Kimi K2.6 + DFlash: 508 tok/s on 8x MI300X 5.6x throughput improvement over baseline autoregressive serving 90 tok/s → …

X AI KOLs Following

Kimi K2.6 paired with DFlash inference system achieves 508 tokens/s on 8×AMD MI300X, a 5.6× throughput jump from 90 tokens/s baseline with zero quality loss.

@Andy_ShuoYang: Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for f…

X AI KOLs Following

The Flash-KMeans team releases FlashLib, a GPU library for classical ML operators that achieves up to 208x speedups over cuML on Hopper GPUs, with a focus on fast, predictable performance for agentic AI workloads.

@hamzaelshafie: New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post…

X AI KOLs Following

A detailed blog post dissecting ThunderKittens, a compact DSL for high-performance AI kernels, including a bottom-up analysis of its abstractions and a benchmark implementing a non-causal attention prefill kernel that outperforms FlashAttention-2 by ~1.55x and matches FlashAttention-3.

@AdinaYakup: Kimi 2.6 is now available on @huggingface https://huggingface.co/moonshotai/Kimi-K2.6… 1T MoE / 32B active / 256K conte…

X AI KOLs Following

Moonshot AI released Kimi 2.6, a 1T-parameter MoE model with 32B active parameters and 256K context length, featuring a 300-sub-agent swarm capable of 4,000-step reasoning.

Similar Articles

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

@HotAisle: Kimi K2.6 + DFlash: 508 tok/s on 8x MI300X 5.6x throughput improvement over baseline autoregressive serving 90 tok/s → …

@Andy_ShuoYang: Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for f…

@hamzaelshafie: New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post…

@AdinaYakup: Kimi 2.6 is now available on @huggingface https://huggingface.co/moonshotai/Kimi-K2.6… 1T MoE / 32B active / 256K conte…

Submit Feedback