MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

Reddit r/LocalLLaMA News

Summary

Benchmark results for running Qwen 3.6 27B on AMD MI50 GPUs using a custom vllm fork, achieving 52.8 tokens/s TG and 1569 tokens/s PP without quantization or MTP, demonstrating usability for agentic tasks on 2018 hardware.

**TL;DR** Results from the title are for single inference with 2 prompt of 1k and 15k tokens. So no MTP (as it’s slower for big prompt), no DFlash (working too but slower for big prompt), no quant used (full precision wanted) and the results are pretty good for a 2018 card. (Bench has been done with TP8, but the model not quantized fits also with TP2 and works pretty fast too, around 34 tps TG) **IMO, fully usable with Claude Code or Hermes or any other agentic harness.** I think there’s still room to go higher (by updating the software & hardware stacks, eg. use of pcie switch with lower latency, more optimized dflash/mtp without overhead for rocm/gfx906, etc) **Inference engine used (vllm fork v0.20.1 with rocm7.2.1)**: [https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main](https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main) **Huggingface Quants used:** *Qwen/Qwen3.6-27B* **Main commands to run**: docker run -it --name vllm-gfx906-mobydick -v /llm:/llm --network host --device=/dev/kfd --device=/dev/dri --group-add video --group-add $(getent group render | cut -d: -f3) --ipc=host aiinfos/ vllm-gfx906-mobydick:v0.20.1rc0.x-rocm7.2.1-pytorch2.11.0 FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG vllm serve \ /llm/models/Qwen3.6-27B \ --served-model-name Qwen3.6-27B \ --dtype float16 \ --max-model-len auto \ --max-num-batched-tokens 8192 \ --block-size 64 \ --gpu-memory-utilization 0.98 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --mm-processor-cache-gb 1 \ --limit-mm-per-prompt.image 1 --limit-mm-per-prompt.video 1 --skip-mm-profiling \ --default-chat-template-kwargs '{"min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}' \ --tensor-parallel-size 8 \ --host 0.0.0.0 \ --port 8000 2>&1 | tee log.txt FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG vllm bench serve \ --dataset-name random \ --random-input-len 10000 \ --random-output-len 1000 \ --num-prompts 4 \ --request-rate 10000 \ --ignore-eos 2>&1 | tee logb.txt **RESULTS:** ============ Serving Benchmark Result ============ Successful requests: 4 Failed requests: 0 Request rate configured (RPS): 10000.00 Benchmark duration (s): 121.54 Total input tokens: 40000 Total generated tokens: 4000 Request throughput (req/s): 0.03 Output token throughput (tok/s): 32.91 Peak output token throughput (tok/s): 56.00 Peak concurrent requests: 4.00 Total token throughput (tok/s): 362.03 ---------------Time to First Token---------------- Mean TTFT (ms): 32874.56 Median TTFT (ms): 35622.63 P99 TTFT (ms): 47843.84 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 88.66 Median TPOT (ms): 85.94 P99 TPOT (ms): 108.67 ---------------Inter-token Latency---------------- Mean ITL (ms): 88.66 Median ITL (ms): 73.61 P99 ITL (ms): 74.26 ==================================================
Original Article

Similar Articles

More Qwen3.6-27B MTP success but on dual Mi50s

Reddit r/LocalLLaMA

The article benchmarks the Qwen3.6-27B model using Multi-Token Prediction (MTP) and tensor parallelism on dual Mi50 GPUs, demonstrating significant speedups via llama.cpp.

Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP

Reddit r/LocalLLaMA

Benchmark comparison of Qwen3.5-122B Q5 and Q6 quantized models using llama.cpp with multi-token prediction on Strix Halo, showing throughput of 20.24 t/s and 17.17 t/s respectively.

Qwen 3.6 benchmarks on 2x RTX PRO 6000

Reddit r/LocalLLaMA

Benchmarks for Qwen 3.6 27B and 35B models on dual RTX PRO 6000 GPUs using VLLM, showing generation throughput up to 3500 tokens per second.