1000 tps generation on Qwen3.6 27B with V100s

Reddit r/LocalLLaMA Models

Summary

Achieved 1000 tokens per second generation on Qwen3.6 27B using V100 GPUs with 128 concurrent requests, and 80 t/s for single user.

I wanted to see what the absolute best case scenario for generation on this setup was and was not disappointed. 128 concurrent requests is so far removed from what I need but it’s funny to see big number. For single user (batch 1 not 128) the generation is around 80t/s with 3000 t/s processing,no mtp!!
Original Article

Similar Articles

Qwen 3.6 benchmarks on 2x RTX PRO 6000

Reddit r/LocalLLaMA

Benchmarks for Qwen 3.6 27B and 35B models on dual RTX PRO 6000 GPUs using VLLM, showing generation throughput up to 3500 tokens per second.