40+tok/s - optimized recipe for Qwen 3.5 122B Int4 on a single DGX Spark with vLLM
Summary
User shares an optimized recipe for running Qwen 3.5 122B Int4 on a single DGX Spark with vLLM, achieving over 40 tokens per second. They invite others to try and further optimize it.
Similar Articles
80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP
A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.
125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar
A user reports achieving 125 tokens per second running Qwen3.6 q4xl on two RTX 4060 Ti GPUs, highlighting excellent performance per dollar and wondering if further optimization can reach 150 tok/s.
Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM
A quantized version of Qwen3.6 27B using a pure Q4_K_M method fits entirely in 16 GB VRAM, achieving up to 40 tok/s token generation speed with MTP, and significantly reducing model size compared to other GGUF variants.
Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post
Reddit user demonstrates llamacpp speculative decoding boosting Qwen-3.6-27B token speed from 13.6 to 136.75 t/s, sharing exact commands and hardware setup.
Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context
The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.