@davideciffa: If you have an Nvidia RTX 4090 --ddtree-budget 36 is the best configuration that buys you 2.5x speed up during decoding…
Summary
A tweet recommending --ddtree-budget 36 for Nvidia RTX 4090, claiming 2.5x speedup during decoding for Qwen3.6_27B.
View Cached Full Text
Cached at: 05/24/26, 04:35 PM
If you have an Nvidia RTX 4090 –ddtree-budget 36 is the best configuration that buys you 2.5x speed up during decoding for Qwen3.6_27B. Thanks for the benchmark https://t.co/bs8xGnAl76 🙌 https://t.co/mO82mEWH7S
Similar Articles
[Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup
Benchmarks of DFlash speculative decoding combined with KV cache compression on RTX 5090 show up to 3.26x speedup on Qwen3.6-27B with minimal perplexity degradation, with q4_0/turbo4 providing the best balance.
Best Settings for 48GB VRAM + Qwen 3.6 27B
A user shares optimized settings for running Qwen3.6 27B (Q8_0) on a dual GPU setup (RTX 4090 + RTX 3090) with llama.cpp, achieving 75-100 t/s and 1500 pp with 250k context.
Cheapest hardware for Qwen 3.6: both 27B and 35B-A3B
Discusses the cheapest hardware options for running Qwen 3.6 models, comparing RTX 3090 and Tesla V100 GPUs, and provides a detailed cost breakdown for a system at around $2000.
@DeepTechTR: Qwen 3.6 27B is incredibly fast with 16 GB VRAM! The impact of Pure Quant The era of the 27B model that runs seamlessly…
Qwen 3.6 27B runs fast on 16 GB VRAM thanks to 'Pure Quant' technology, achieving 40 tokens/s with MTP and supporting 64k contexts, enabling local AI on consumer GPUs like RTX 4060 Ti.
Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result
A detailed account of running the Qwen3.6-35B-A3B MoE model on an 8GB laptop GPU, covering effective optimizations like --no-mmap and VRAM headroom, unexpected findings where speculative decoding improved speed by 26% contrary to benchmarks, and pitfalls with Windows and CPU bottlenecks.