@jun_song: Working on fitting Kimi-K2.6 (1T) on 128GB Mac. Trying to get 40tok/s, and minimize the quality loss.
Summary
A developer is optimizing the Kimi-K2.6 (1T) model to run efficiently on a 128GB Mac, targeting 40 tokens per second while minimizing quality loss.
View Cached Full Text
Cached at: 05/11/26, 12:42 PM
Working on fitting Kimi-K2.6 (1T) on 128GB Mac.
Trying to get 40tok/s, and minimize the quality loss.
Similar Articles
@QuixiAI: @Kimi_Moonshot K2.6 running on my mi300x, 56 tps (single request). I will run a throughput test
Kimi K2.6 achieves 56 tokens per second on a single MI300X GPU; user plans further throughput benchmarking.
@HotAisle: Kimi K2.6 + DFlash: 508 tok/s on 8x MI300X 5.6x throughput improvement over baseline autoregressive serving 90 tok/s → …
Kimi K2.6 paired with DFlash inference system achieves 508 tokens/s on 8×AMD MI300X, a 5.6× throughput jump from 90 tokens/s baseline with zero quality loss.
@AdinaYakup: Kimi 2.6 is now available on @huggingface https://huggingface.co/moonshotai/Kimi-K2.6… 1T MoE / 32B active / 256K conte…
Moonshot AI released Kimi 2.6, a 1T-parameter MoE model with 32B active parameters and 256K context length, featuring a 300-sub-agent swarm capable of 4,000-step reasoning.
@sanbuphy: K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times…
K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times, boosting throughput from ~15 tokens/s to ~193 tokens/s, ultimately achieving 20% faster inference than LM Studio.
@tom_doerr: Runs 35B models on 16GB RAM Macs https://github.com/walter-grace/mac-code…
A tool that enables running large language models like Qwen3.5-35B on 16GB Macs by streaming model weights from SSD, achieving up to 30 tok/s with an optimal configuration.