@iotcoi: Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny G…

X AI KOLs Timeline 04/22/26, 01:54 PM Models

Summary

Quantized 27B Qwen3.6 model achieves 200 tok/s peak (136 avg) with 256k context and 10 agents on a single 49W GB10 GPU using Dflash+DDTree optimizations.

Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny GB10 GPU at 49W power

Original Article

View Cached Full Text

Cached at: 04/22/26, 05:51 PM

Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny GB10 GPU at 49W power

Similar Articles

@iotcoi: Ran Google’s cookbook with 10 agents on my tiny GB10 GPU. 436 tok/s / 43.6 per agent Qwen3.6-35B + Dflash + DDTree on v…

X AI KOLs Timeline

A developer ran 10 concurrent agents of the 35B-parameter Qwen3.6 model on a single 74W GB10 GPU at 436 tok/s total using vLLM, demonstrating high-efficiency edge deployment.

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Reddit r/LocalLLaMA

A quantized version of Qwen3.6 27B using a pure Q4_K_M method fits entirely in 16 GB VRAM, achieving up to 40 tok/s token generation speed with MTP, and significantly reducing model size compared to other GGUF variants.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.

Qwen 3.6 27B Speculative Decoding Bench: Pushing ~100 TPS on a single RTX 3090

Reddit r/LocalLLaMA

A detailed benchmark comparing speculative decoding engines for Qwen 3.6 27B on a single RTX 3090, showing ik_llama achieving ~100 tokens per second in code generation. Results include decode TPS, TTFT, VRAM usage, and context degradation across 5 engine variants.

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Reddit r/LocalLLaMA

A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.

Similar Articles

@iotcoi: Ran Google’s cookbook with 10 agents on my tiny GB10 GPU. 436 tok/s / 43.6 per agent Qwen3.6-35B + Dflash + DDTree on v…

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Qwen 3.6 27B Speculative Decoding Bench: Pushing ~100 TPS on a single RTX 3090

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Submit Feedback