cutlass

#cutlass

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Reddit r/LocalLLaMA ↗ · 2026-04-22

MoonshotAI released FlashKDA, open-source CUTLASS kernels for Kimi Delta Attention that deliver up to 2.22x speedup over Triton on H20 GPUs.

0 favorites 0 likes

#cutlass

@Kimi_Moonshot: We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achie…

X AI KOLs Following ↗ · 2026-04-21

Moonshot AI releases FlashKDA, an open-source CUTLASS-based implementation of Kimi Delta Attention kernels that delivers 1.72×–2.22× prefill speedup on H20 GPUs.

0 favorites 0 likes

#cutlass

C++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 — what should new GPU kernel / LLM inference engineers actually learn?[D]

Reddit r/MachineLearning ↗ · 2026-04-20

Discussion of the shift in GPU kernel engineering from C++ CuTe/CUTLASS to NVIDIA's Python-based CuTeDSL, questioning whether new engineers should learn legacy C++ templates or prioritize the emerging stack for LLM inference work.

0 favorites 0 likes

cutlass

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

@Kimi_Moonshot: We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achie…

C++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 — what should new GPU kernel / LLM inference engineers actually learn?[D]

Submit Feedback