1800美元(GPU成本,使用P2P运行Qwen/Qwen3.6-27b-FP8,262K上下文,BF16 KV缓存,55 tok/s)

Reddit r/LocalLLaMA 新闻

摘要

一位用户分享了使用4块RTX 5060 Ti 16GB显卡(支持P2P)运行Qwen3.6-27B-FP8的配置,在262K上下文下实现55 tok/s的速度,强调单用户推理成本仅约1800美元。

嘿,各位,想和大家分享一下在仅推理的单用户场景下,用1700美元GPU成本能做到什么。配置:4块5060 ti(16GB)支持P2P。如果你在美国,留意Facebook Marketplace和Slickdeals这类网站,可以找到二手5060 ti 16GB型号,价格在425到475美元之间。一个重要的警告是,这种配置只适合纯粹的推理任务。 The VLLM Command Used: export VLLM_SLEEP_WHEN_IDLE=1 export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 export VLLM_WORKER_MULTIPROC_METHOD=spawn export SAFETENSORS_FAST_GPU=1 export NCCL_P2P_DISABLE=0 export NCCL_CUMEM_ENABLE=1 export CUDA_DEVICE_ORDER=PCI_BUS_ID export TORCH_FLOAT32_MATMUL_PRECISION=high export PYTORCH_ALLOC_CONF=expandable_segments:True # dropped: VLLM_USE_FLASHINFER_MOE_FP8 (dense model), VLLM_TEST_FORCE_FP8_MARLIN (test native FP8 first) vllm serve /data/models/Qwen/Qwen3.6-27B-FP8 \ --host 0.0.0.0 --port 8080 \ --tensor-parallel-size 4 \ --performance-mode interactivity \ --trust-remote-code \ --language-model-only \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --max-model-len 262144 \ --kv-cache-dtype bfloat16 \ --max-num-seqs 4 \ --gpu-memory-utilization 0.92 \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}' \ --compilation-config '{"max_cudagraph_capture_size":16,"mode":"VLLM_COMPILE"}' \ --async-scheduling \ --attention-backend flashinfer \ --enable-prefix-caching Benchmark Command: vllm bench serve --backend vllm --base-url http://localhost:8080 --endpoint /v1/completions --model /data/models/Qwen/Qwen3.6-27B-FP8 --dataset-name random --random-input-len 4096 --random-output-len 1024 --num-prompts 40 --max-concurrency 1 --num-warmups 5 --ignore-eos --seed 1234 --percentile-metrics ttft,tpot,itl,e2el --save-result --result-filename qwen36_c1_4k.json ============ Serving Benchmark Result ============ Successful requests: 40 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 735.75 Total input tokens: 163840 Total generated tokens: 40960 Request throughput (req/s): 0.05 Output token throughput (tok/s): 55.67 Peak output token throughput (tok/s): 25.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 278.36 ---------------Time to First Token---------------- Mean TTFT (ms): 4226.91 Median TTFT (ms): 4315.47 P99 TTFT (ms): 4320.32 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 13.85 Median TPOT (ms): 13.44 P99 TPOT (ms): 25.61 ---------------Inter-token Latency---------------- Mean ITL (ms): 40.91 Median ITL (ms): 40.84 P99 ITL (ms): 41.59 ----------------End-to-end Latency---------------- Mean E2EL (ms): 18393.49 Median E2EL (ms): 17991.18 P99 E2EL (ms): 30508.70 ---------------Speculative Decoding--------------- Acceptance rate (%): 65.25 Acceptance length: 2.96 Drafts: 13853 Draft tokens: 41559 Accepted tokens: 27116 Per-position acceptance (%): Position 0: 78.29 Position 1: 64.14 Position 2: 53.31 ================================================== 注意:我忘了设置--max-num-seqs为4,但我是用并发数为1进行基准测试的。
查看原文

相似文章