两块旧款RTX 2080 Ti，每块22GB显存，运行Qwen3.6 27B，使用f16 KV缓存达到38 token/s

Reddit r/LocalLLaMA 2026/05/15 11:48 工具

rtx-2080-ti llama.cpp qwen-3.6 llm-inference self-hosting gpu-vram-mod docker

摘要

一位用户分享其配置：使用两块改装版RTX 2080 Ti GPU（每块22GB显存）通过llama.cpp以38 token/s运行Qwen 3.6 27B，并包含关于功耗限制、张量分割模式和KV缓存设置的技巧。

请记住，我的两张显卡都限制功耗为150W（我讨厌噪音） \------- 只是想分享一下我当前的设置，可能对某些用户有帮助... services: llama-server: image: ghcr.io/ggml-org/llama.cpp:full-cuda12-b9128 container_name: llama-server restart: unless-stopped ports: - "16384:8080" volumes: - ./models:/models:ro command: > --server --model /models/Qwen3.6-27B-IQ4_XS-uc.gguf --alias "Qwen3.6 27B" --temp 0.6 --top-p 0.95 --min-p 0.00 --top-k 20 --port 8080 --host 0.0.0.0 --cache-type-k f16 --cache-type-v f16 --fit on --presence-penalty 1.32 --repeat-penalty 1.0 --jinja --chat-template-file /models/Qwen3.6.jinja --mmproj /models/Qwen3.6-27B-mmproj-BF16.gguf --webui --spec-default --chat-template-kwargs '{"preserve_thinking": true}' --reasoning-budget 8192 --reasoning-budget-message "... thinking budget exceeded, let's answer now.\n" --split-mode tensor user: "1000:1000" deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] environment: - NVIDIA_VISIBLE_DEVICES=all 这是我确切的配置，我的两块非常老的2080Ti显卡在中国升级为每块22GB显存... 我在eBay上买了一个NVLINK桥（我不建议购买，因为没有任何可测量的差异）我运行的量化是IQ4\_XS。如果我将KV缓存改为q8\_0，在长时间的编码会话中有时会出现模型循环问题，这就是我运行kv-cache@f16的原因，自那以后再也没有出现过这个问题。我使用的是hauhaucs的qwen3.6无审查模型，采用IQ4矩阵量化。你还可以忽略MTP，因为这些显卡是计算密集型而非带宽密集型。最大的提升来自于--split-mode tensor，这使速度从14 token/s提高到38 token/s。我认为如果没有功耗限制，我们应该能达到45 token/s。另外我从未考虑过的是--fit on... 我一直手动声明上下文长度，效果很好，但看起来始终使用95%显存并不是个好主意。--fit on也略微提升了token生成速度。顺便说一下，这套配置不到1000美元，峰值功耗400瓦，与hermes和opencode配合得很好。我使用的jinja模板是这个： [https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates](https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates) (在这个设置中模板11，我还没有测试更新的模板) https://preview.redd.it/gasb8yo8ga1h1.png?width=476&format=png&auto=webp&s=0450efcae279b0bcbd33f9d6d4f7241d8e3581d4

查看原文

两块旧款RTX 2080 Ti，每块22GB显存，运行Qwen3.6 27B，使用f16 KV缓存达到38 token/s

相似文章

@rumgewieselt：现在变得疯狂了……三块 1080 Ti（Pascal架构，33GB VRAM）Qwen 3.6 27B MTP 搭配 196K TurboQuant，持续 ~28-30 t/s

在 12GB 显存下，使用 Qwen3.6 35B A3B 与 llama.cpp MTP 实现 80 tok/sec 的速度和 128K 上下文

RTX Pro 4500 Blackwell - Qwen 3.6 27B？

8GB 显存跑 Qwen3.6 35B MoE 的 llama-server 配置 + 我踩的 max_tokens / thinking 陷阱

大家在 Qwen3.6 27b 上跑出来的速度是多少？

提交意见反馈