A user shares a configuration of 4x RTX 5060 Ti 16GB with P2P to run Qwen3.6-27B-FP8 at 55 tok/s with 262K context, highlighting the low cost of about $1800 for single-user inference.
Hey peeps, wanted to share what is possible for folks with an inference only single user use case with 1700 in GPU cost. Setup: 4x 5060 ti (16GB) with P2P If you are in the US and you keep an eye on facebook marketplace and places like slickdeals you can find some 5060 ti 16 GB models for 425 to 475 used. A giant caveat is this type of configuration is only viable if your only interested in strictly inference. The VLLM Command Used: export VLLM_SLEEP_WHEN_IDLE=1 export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 export VLLM_WORKER_MULTIPROC_METHOD=spawn export SAFETENSORS_FAST_GPU=1 export NCCL_P2P_DISABLE=0 export NCCL_CUMEM_ENABLE=1 export CUDA_DEVICE_ORDER=PCI_BUS_ID export TORCH_FLOAT32_MATMUL_PRECISION=high export PYTORCH_ALLOC_CONF=expandable_segments:True # dropped: VLLM_USE_FLASHINFER_MOE_FP8 (dense model), VLLM_TEST_FORCE_FP8_MARLIN (test native FP8 first) vllm serve /data/models/Qwen/Qwen3.6-27B-FP8 \ --host 0.0.0.0 --port 8080 \ --tensor-parallel-size 4 \ --performance-mode interactivity \ --trust-remote-code \ --language-model-only \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --max-model-len 262144 \ --kv-cache-dtype bfloat16 \ --max-num-seqs 4 \ --gpu-memory-utilization 0.92 \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}' \ --compilation-config '{"max_cudagraph_capture_size":16,"mode":"VLLM_COMPILE"}' \ --async-scheduling \ --attention-backend flashinfer \ --enable-prefix-caching Benchmark Command: vllm bench serve --backend vllm --base-url http://localhost:8080 --endpoint /v1/completions --model /data/models/Qwen/Qwen3.6-27B-FP8 --dataset-name random --random-input-len 4096 --random-output-len 1024 --num-prompts 40 --max-concurrency 1 --num-warmups 5 --ignore-eos --seed 1234 --percentile-metrics ttft,tpot,itl,e2el --save-result --result-filename qwen36_c1_4k.json ============ Serving Benchmark Result ============ Successful requests: 40 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 735.75 Total input tokens: 163840 Total generated tokens: 40960 Request throughput (req/s): 0.05 Output token throughput (tok/s): 55.67 Peak output token throughput (tok/s): 25.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 278.36 ---------------Time to First Token---------------- Mean TTFT (ms): 4226.91 Median TTFT (ms): 4315.47 P99 TTFT (ms): 4320.32 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 13.85 Median TPOT (ms): 13.44 P99 TPOT (ms): 25.61 ---------------Inter-token Latency---------------- Mean ITL (ms): 40.91 Median ITL (ms): 40.84 P99 ITL (ms): 41.59 ----------------End-to-end Latency---------------- Mean E2EL (ms): 18393.49 Median E2EL (ms): 17991.18 P99 E2EL (ms): 30508.70 ---------------Speculative Decoding--------------- Acceptance rate (%): 65.25 Acceptance length: 2.96 Drafts: 13853 Draft tokens: 41559 Accepted tokens: 27116 Per-position acceptance (%): Position 0: 78.29 Position 1: 64.14 Position 2: 53.31 ================================================== note: I forgot I had --max-num-seqs at 4 but I benchmarked with 1 concurrency.
A user shares their setup using two modded RTX 2080 Ti GPUs with 22GB VRAM each to run Qwen 3.6 27B at 38 tokens/s with llama.cpp, including tips on power limiting, tensor split mode, and KV cache settings.
A user demonstrates successful local inference of a 27B parameter Qwen model across three GTX 1080 Ti GPUs, achieving approximately 28-30 tokens per second using TurboQuant optimization.
The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.
A user reports achieving 125 tokens per second running Qwen3.6 q4xl on two RTX 4060 Ti GPUs, highlighting excellent performance per dollar and wondering if further optimization can reach 150 tok/s.
Developer achieved 80+ t/s inference on Qwen3.6-27B with 262K context on a single RTX 4090 by combining MTP (Multi-Token Prediction) with TurboQuant's lossless KV cache compression, sharing their implementation fork and technical details.