@HotAisle: Kimi K2.6 + DFlash: 508 tok/s on 8x MI300X 5.6x throughput improvement over baseline autoregressive serving 90 tok/s → …

X AI KOLs Following 04/21/26, 04:01 PM Models

Summary

Kimi K2.6 paired with DFlash inference system achieves 508 tokens/s on 8×AMD MI300X, a 5.6× throughput jump from 90 tokens/s baseline with zero quality loss.

Kimi K2.6 + DFlash: 508 tok/s on 8x MI300X 5.6x throughput improvement over baseline autoregressive serving 90 tok/s → 508 tok/s on the same hardware, same model, zero quality loss

Original Article

@HotAisle: Kimi K2.6 + DFlash: 508 tok/s on 8x MI300X 5.6x throughput improvement over baseline autoregressive serving 90 tok/s → …

Similar Articles

@QuixiAI: @Kimi_Moonshot K2.6 running on my mi300x, 56 tps (single request). I will run a throughput test

@jun_song: Working on fitting Kimi-K2.6 (1T) on 128GB Mac. Trying to get 40tok/s, and minimize the quality loss.

@Kimi_Moonshot: We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achie…

@iotcoi: Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny G…

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Submit Feedback

Similar Articles

@QuixiAI: @Kimi_Moonshot K2.6 running on my mi300x, 56 tps (single request). I will run a throughput test
Kimi K2.6 achieves 56 tokens per second on a single MI300X GPU; user plans further throughput benchmarking.

@jun_song: Working on fitting Kimi-K2.6 (1T) on 128GB Mac. Trying to get 40tok/s, and minimize the quality loss.
A developer is optimizing the Kimi-K2.6 (1T) model to run efficiently on a 128GB Mac, targeting 40 tokens per second while minimizing quality loss.

@Kimi_Moonshot: We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achie…

@iotcoi: Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny G…

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20
MoonshotAI released FlashKDA, open-source CUTLASS kernels for Kimi Delta Attention that deliver up to 2.22x speedup over Triton on H20 GPUs.