Tag
MoonshotAI released FlashKDA, open-source CUTLASS kernels for Kimi Delta Attention that deliver up to 2.22x speedup over Triton on H20 GPUs.
Moonshot AI releases FlashKDA, an open-source CUTLASS-based implementation of Kimi Delta Attention kernels that delivers 1.72×–2.22× prefill speedup on H20 GPUs.
Discussion of the shift in GPU kernel engineering from C++ CuTe/CUTLASS to NVIDIA's Python-based CuTeDSL, questioning whether new engineers should learn legacy C++ templates or prioritize the emerging stack for LLM inference work.