@HotAisle: Kimi K2.6 + DFlash: 508 tok/s on 8x MI300X 5.6x throughput improvement over baseline autoregressive serving 90 tok/s → …
Summary
Kimi K2.6 paired with DFlash inference system achieves 508 tokens/s on 8×AMD MI300X, a 5.6× throughput jump from 90 tokens/s baseline with zero quality loss.
Similar Articles
@QuixiAI: @Kimi_Moonshot K2.6 running on my mi300x, 56 tps (single request). I will run a throughput test
Kimi K2.6 achieves 56 tokens per second on a single MI300X GPU; user plans further throughput benchmarking.
@jun_song: Working on fitting Kimi-K2.6 (1T) on 128GB Mac. Trying to get 40tok/s, and minimize the quality loss.
A developer is optimizing the Kimi-K2.6 (1T) model to run efficiently on a 128GB Mac, targeting 40 tokens per second while minimizing quality loss.
@Kimi_Moonshot: We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achie…
Moonshot AI releases FlashKDA, an open-source CUTLASS-based implementation of Kimi Delta Attention kernels that delivers 1.72×–2.22× prefill speedup on H20 GPUs.
@iotcoi: Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny G…
Quantized 27B Qwen3.6 model achieves 200 tok/s peak (136 avg) with 256k context and 10 agents on a single 49W GB10 GPU using Dflash+DDTree optimizations.
Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20
MoonshotAI released FlashKDA, open-source CUTLASS kernels for Kimi Delta Attention that deliver up to 2.22x speedup over Triton on H20 GPUs.