2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache

Reddit r/LocalLLaMA 05/15/26, 11:48 AM Tools

rtx-2080-ti llama.cpp qwen-3.6 llm-inference self-hosting gpu-vram-mod docker

Summary

A user shares their setup using two modded RTX 2080 Ti GPUs with 22GB VRAM each to run Qwen 3.6 27B at 38 tokens/s with llama.cpp, including tips on power limiting, tensor split mode, and KV cache settings.

PLEASE KEEP IN MIND BOTH OF MY CARDS ARE POWER LIMITED TO 150W (i hate noise) \------- Just wanted to share my current setup, that might help some users out there... services: llama-server: image: ghcr.io/ggml-org/llama.cpp:full-cuda12-b9128 container_name: llama-server restart: unless-stopped ports: - "16384:8080" volumes: - ./models:/models:ro command: > --server --model /models/Qwen3.6-27B-IQ4_XS-uc.gguf --alias "Qwen3.6 27B" --temp 0.6 --top-p 0.95 --min-p 0.00 --top-k 20 --port 8080 --host 0.0.0.0 --cache-type-k f16 --cache-type-v f16 --fit on --presence-penalty 1.32 --repeat-penalty 1.0 --jinja --chat-template-file /models/Qwen3.6.jinja --mmproj /models/Qwen3.6-27B-mmproj-BF16.gguf --webui --spec-default --chat-template-kwargs '{"preserve_thinking": true}' --reasoning-budget 8192 --reasoning-budget-message "... thinking budget exceeded, let's answer now.\n" --split-mode tensor user: "1000:1000" deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] environment: - NVIDIA_VISIBLE_DEVICES=all This is my exact config, my 2 extremely old 2080Ti gpus where upgraded in china to have 22GB vram each... and on ebay i bought a NVLINK (i do not recommend bying it, as no meassurable difference appears) Quantisation i run is IQ4\_XS if i change the kv cache to q8\_0 it sometimes happens during long coding sessions that the model loops, this is why i run kv-cache@f16 and never have this problem since then. i use the hauhaucs qwen3.6 model uncensored on IQ4 matrix quants. You can also forget about MTP as you are compute bound with those cards and not bandwidth bound. The absolut biggest boost came from --split-mode tensor , this gave me a boost from 14 token/s to 38t/s i think without the power limit we should get 45 token/s what i also never did think about is the --fit on ... i always declared context length manually worked great but it looks like its not a good idea to always run at 95% vram consumption. fit on also improved token gen a little. Btw. this is a < 1k USD setup running on 400w peak on the wall, and it works great with hermes and opencode. the jinja template i use is this one: [https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates](https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates) (in this setup template 11, i did not yet test the newer templates) https://preview.redd.it/gasb8yo8ga1h1.png?width=476&format=png&auto=webp&s=0450efcae279b0bcbd33f9d6d4f7241d8e3581d4

Original Article

2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache

Similar Articles

@rumgewieselt: Now its getting crazy ... 3x 1080 Ti (Pascal, 33GB VRAM) Qwen 3.6 27B MTP with 196K TurboQuant ~28-30 t/s consistently

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

RTX Pro 4500 Blackwell - Qwen 3.6 27B?

Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into

What speed is everyone getting on Qwen3.6 27b?

Submit Feedback

Similar Articles

@rumgewieselt: Now its getting crazy ... 3x 1080 Ti (Pascal, 33GB VRAM) Qwen 3.6 27B MTP with 196K TurboQuant ~28-30 t/s consistently

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

RTX Pro 4500 Blackwell - Qwen 3.6 27B?

Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into

What speed is everyone getting on Qwen3.6 27b?
User benchmarks Qwen3.6-27B-Q8_0 at ~13 tokens/sec on 3 mixed GPUs with 128k context via llama.cpp, asking if performance is typical.