1000 tps generation on Qwen3.6 27B with V100s
Summary
Achieved 1000 tokens per second generation on Qwen3.6 27B using V100 GPUs with 128 concurrent requests, and 80 t/s for single user.
Similar Articles
125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar
A user reports achieving 125 tokens per second running Qwen3.6 q4xl on two RTX 4060 Ti GPUs, highlighting excellent performance per dollar and wondering if further optimization can reach 150 tok/s.
Qwen 3.6 benchmarks on 2x RTX PRO 6000
Benchmarks for Qwen 3.6 27B and 35B models on dual RTX PRO 6000 GPUs using VLLM, showing generation throughput up to 3500 tokens per second.
What speed is everyone getting on Qwen3.6 27b?
User benchmarks Qwen3.6-27B-Q8_0 at ~13 tokens/sec on 3 mixed GPUs with 128k context via llama.cpp, asking if performance is typical.
MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)
Benchmark results for running Qwen 3.6 27B on AMD MI50 GPUs using a custom vllm fork, achieving 52.8 tokens/s TG and 1569 tokens/s PP without quantization or MTP, demonstrating usability for agentic tasks on 2018 hardware.
@ItsmeAjayKV: Achievement Unlocked: Running Qwen3.6-27b dense Thanks to the RTX 3090, now I can do this. Running @Alibaba_Qwen Qwen 3…
User benchmarks Qwen3.6-27B on an RTX 3090 using llama.cpp, achieving 35 tok/s generation and 1247 tok/s prompt processing.