$1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s

Reddit r/LocalLLaMA News

Summary

A user shares a configuration of 4x RTX 5060 Ti 16GB with P2P to run Qwen3.6-27B-FP8 at 55 tok/s with 262K context, highlighting the low cost of about $1800 for single-user inference.

Hey peeps, wanted to share what is possible for folks with an inference only single user use case with 1700 in GPU cost. Setup: 4x 5060 ti (16GB) with P2P If you are in the US and you keep an eye on facebook marketplace and places like slickdeals you can find some 5060 ti 16 GB models for 425 to 475 used. A giant caveat is this type of configuration is only viable if your only interested in strictly inference. The VLLM Command Used: export VLLM_SLEEP_WHEN_IDLE=1 export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 export VLLM_WORKER_MULTIPROC_METHOD=spawn export SAFETENSORS_FAST_GPU=1 export NCCL_P2P_DISABLE=0 export NCCL_CUMEM_ENABLE=1 export CUDA_DEVICE_ORDER=PCI_BUS_ID export TORCH_FLOAT32_MATMUL_PRECISION=high export PYTORCH_ALLOC_CONF=expandable_segments:True # dropped: VLLM_USE_FLASHINFER_MOE_FP8 (dense model), VLLM_TEST_FORCE_FP8_MARLIN (test native FP8 first) vllm serve /data/models/Qwen/Qwen3.6-27B-FP8 \ --host 0.0.0.0 --port 8080 \ --tensor-parallel-size 4 \ --performance-mode interactivity \ --trust-remote-code \ --language-model-only \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --max-model-len 262144 \ --kv-cache-dtype bfloat16 \ --max-num-seqs 4 \ --gpu-memory-utilization 0.92 \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}' \ --compilation-config '{"max_cudagraph_capture_size":16,"mode":"VLLM_COMPILE"}' \ --async-scheduling \ --attention-backend flashinfer \ --enable-prefix-caching Benchmark Command: vllm bench serve --backend vllm --base-url http://localhost:8080 --endpoint /v1/completions --model /data/models/Qwen/Qwen3.6-27B-FP8 --dataset-name random --random-input-len 4096 --random-output-len 1024 --num-prompts 40 --max-concurrency 1 --num-warmups 5 --ignore-eos --seed 1234 --percentile-metrics ttft,tpot,itl,e2el --save-result --result-filename qwen36_c1_4k.json ============ Serving Benchmark Result ============ Successful requests: 40 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 735.75 Total input tokens: 163840 Total generated tokens: 40960 Request throughput (req/s): 0.05 Output token throughput (tok/s): 55.67 Peak output token throughput (tok/s): 25.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 278.36 ---------------Time to First Token---------------- Mean TTFT (ms): 4226.91 Median TTFT (ms): 4315.47 P99 TTFT (ms): 4320.32 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 13.85 Median TPOT (ms): 13.44 P99 TPOT (ms): 25.61 ---------------Inter-token Latency---------------- Mean ITL (ms): 40.91 Median ITL (ms): 40.84 P99 ITL (ms): 41.59 ----------------End-to-end Latency---------------- Mean E2EL (ms): 18393.49 Median E2EL (ms): 17991.18 P99 E2EL (ms): 30508.70 ---------------Speculative Decoding--------------- Acceptance rate (%): 65.25 Acceptance length: 2.96 Drafts: 13853 Draft tokens: 41559 Accepted tokens: 27116 Per-position acceptance (%): Position 0: 78.29 Position 1: 64.14 Position 2: 53.31 ================================================== note: I forgot I had --max-num-seqs at 4 but I benchmarked with 1 concurrency.
Original Article

Similar Articles

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.