MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

Reddit r/LocalLLaMA 05/13/26, 07:08 PM News

mi50 qwen-3-6 benchmark inference amd-gpu vllm performance

Summary

Benchmark results for running Qwen 3.6 27B on AMD MI50 GPUs using a custom vllm fork, achieving 52.8 tokens/s TG and 1569 tokens/s PP without quantization or MTP, demonstrating usability for agentic tasks on 2018 hardware.

**TL;DR** Results from the title are for single inference with 2 prompt of 1k and 15k tokens. So no MTP (as it’s slower for big prompt), no DFlash (working too but slower for big prompt), no quant used (full precision wanted) and the results are pretty good for a 2018 card. (Bench has been done with TP8, but the model not quantized fits also with TP2 and works pretty fast too, around 34 tps TG) **IMO, fully usable with Claude Code or Hermes or any other agentic harness.** I think there’s still room to go higher (by updating the software & hardware stacks, eg. use of pcie switch with lower latency, more optimized dflash/mtp without overhead for rocm/gfx906, etc) **Inference engine used (vllm fork v0.20.1 with rocm7.2.1)**: [https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main](https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main) **Huggingface Quants used:** *Qwen/Qwen3.6-27B* **Main commands to run**: docker run -it --name vllm-gfx906-mobydick -v /llm:/llm --network host --device=/dev/kfd --device=/dev/dri --group-add video --group-add $(getent group render | cut -d: -f3) --ipc=host aiinfos/ vllm-gfx906-mobydick:v0.20.1rc0.x-rocm7.2.1-pytorch2.11.0 FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG vllm serve \ /llm/models/Qwen3.6-27B \ --served-model-name Qwen3.6-27B \ --dtype float16 \ --max-model-len auto \ --max-num-batched-tokens 8192 \ --block-size 64 \ --gpu-memory-utilization 0.98 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --mm-processor-cache-gb 1 \ --limit-mm-per-prompt.image 1 --limit-mm-per-prompt.video 1 --skip-mm-profiling \ --default-chat-template-kwargs '{"min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}' \ --tensor-parallel-size 8 \ --host 0.0.0.0 \ --port 8000 2>&1 | tee log.txt FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG vllm bench serve \ --dataset-name random \ --random-input-len 10000 \ --random-output-len 1000 \ --num-prompts 4 \ --request-rate 10000 \ --ignore-eos 2>&1 | tee logb.txt **RESULTS:** ============ Serving Benchmark Result ============ Successful requests: 4 Failed requests: 0 Request rate configured (RPS): 10000.00 Benchmark duration (s): 121.54 Total input tokens: 40000 Total generated tokens: 4000 Request throughput (req/s): 0.03 Output token throughput (tok/s): 32.91 Peak output token throughput (tok/s): 56.00 Peak concurrent requests: 4.00 Total token throughput (tok/s): 362.03 ---------------Time to First Token---------------- Mean TTFT (ms): 32874.56 Median TTFT (ms): 35622.63 P99 TTFT (ms): 47843.84 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 88.66 Median TPOT (ms): 85.94 P99 TPOT (ms): 108.67 ---------------Inter-token Latency---------------- Mean ITL (ms): 88.66 Median ITL (ms): 73.61 P99 ITL (ms): 74.26 ==================================================

Original Article

MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

Similar Articles

More Qwen3.6-27B MTP success but on dual Mi50s

8-16 MI50s Minimax M3 @19 tps TG (peak)

Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Qwen 3.6 benchmarks on 2x RTX PRO 6000

Submit Feedback

Similar Articles

More Qwen3.6-27B MTP success but on dual Mi50s

8-16 MI50s Minimax M3 @19 tps TG (peak)
Reports a peak throughput of 19 tokens per second for the Minimax M3 model running on 8-16 MI50 GPUs.

Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Qwen 3.6 benchmarks on 2x RTX PRO 6000
Benchmarks for Qwen 3.6 27B and 35B models on dual RTX PRO 6000 GPUs using VLLM, showing generation throughput up to 3500 tokens per second.