Help optimizing llama.cpp + Qwen 27B on RTX PRO 6000 Blackwell for coding agents

Reddit r/LocalLLaMA 06/26/26, 09:04 AM Tools

llama-cpp qwen optimization coding-agent rtx-pro-6000 troubleshooting windows

Summary

A user details their setup running Qwen 27B with llama.cpp on an RTX PRO 6000 Blackwell for local coding agents, compares performance to Claude models, and asks for help resolving frequent crashes and malformed response issues.

Our company recently acquired a workstation with an RTX PRO 6000 Blackwell, and we're experimenting with local LLMs to reduce part of our Claude token usage. Right now we’re running Qwen3.6 27B MTP Q8_K_XL with llama.cpp on Windows 11. I've been using both Claude Opus and Sonnet for a while, and my impression is that this model feels somewhat comparable to Sonnet, but a bit weaker and slower. It is definitely better than Haiku for our use case, but not quite at Sonnet level. Opus is still in another class. That said, considering the relatively small parameter count, the model is surprisingly good at reasoning and tool calling. Its main weakness seems to be lack of knowledge. For coding, I would strongly recommend giving it access to tools like Context7 and Serper, or otherwise allowing it to check documentation and search the web. Once we did that, it became much less likely to invent or guess class names, field names, APIs, and similar details. However, we're currently running into major stability issues during coding sessions. We use VS Code with the Copilot extension. Sometimes the agent randomly stops with: I tried debugging the issue, and my current guess is that the model sometimes produces a malformed response, possibly with the wrong thinking format or with the response sections in the wrong order. Copilot then seems to interpret the response as empty. This happens randomly, but quite frequently. Sometimes the llama.cpp executable also crashes outright and terminates mid-session. We're using the latest release, and we even set up a scheduled job to rebuild llama.cpp every morning so we can keep up with updates instead of doing it manually. We switched to the MTP version because it was around 15–20% faster, with quality roughly on par with the non-MTP version. This is our llama.cpp compile command: cmake .. -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_NATIVE=ON -DGGML_LTO=ON -DGGML_CUDA_GRAPHS=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES=120 cmake --build . --config Release --target llama-server llama-bench llama-fit-params llama-cli --parallel We run 4 parallel agents, each with full context. This is our llama.cpp startup command: llama-server.exe -m "D:\DATA\models\Qwen3.6-27B-UD-Q8_K_XL_MTP.gguf" -ngl 99 -lv 4 -fa on -c 1048576 -np 4 -ctk q8_0 -ctv q8_0 --spec-draft-type-k q8_0 --spec-draft-type-v q8_0 --spec-type draft-mtp --spec-draft-n-max 2 --metrics --port 5764 --host 0.0.0.0 -b 8192 -ub 2048 --cache-prompt --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 --presence-penalty 0.0 --repeat-penalty 1.0 --reasoning-format deepseek --chat-template-kwargs "{\"preserve_thinking\":true}" --reasoning on --reasoning-format deepseek --reasoning-budget 8192 Windows and other running programs use around 3 GB of VRAM. Total VRAM usage is roughly 83 GB out of 97 GB. The workstation also has 128 GB of DDR5. This is our custom endpoint configuration in Copilot: { "name": "llama-server", "vendor": "customendpoint", "apiType": "chat-completions", "models": [ { "id": "qwen3-6-27B", "name": "Qwen3.6 27B", "url": "http://192.168.1.1:5764/v1/chat/completions", "toolCalling": true, "vision": false, "streaming": true, "maxInputTokens": 230000, "maxOutputTokens": 16000 } ] } At this point, we're a bit at a loss. This may very well be a skill issue or a lack of understanding on our part about how to properly exploit this hardware. That's why I'm asking here: does anyone with more experience running local coding agents on high-end GPUs have suggestions for improving this setup, especially the stability issues? Thanks in advance to everyone. This sub has been an amazing place to learn and discover new things!

Original Article

Help optimizing llama.cpp + Qwen 27B on RTX PRO 6000 Blackwell for coding agents

Similar Articles

RTX Pro 4500 Blackwell - Qwen 3.6 27B?

Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

Best config for Qwen3.6 27b / llama.cpp / opencode

Running Qwen3.6-35B-A3B Locally for Coding Agent: My Setup & Working Config

Submit Feedback

Similar Articles

RTX Pro 4500 Blackwell - Qwen 3.6 27B?

Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

Best config for Qwen3.6 27b / llama.cpp / opencode
Community thread sharing optimized llama.cpp launch commands for running the 27B Qwen3.6 GGUF model with long 100K-512K context on multi-GPU setups.

Running Qwen3.6-35B-A3B Locally for Coding Agent: My Setup & Working Config