A user details their setup running Qwen 27B with llama.cpp on an RTX PRO 6000 Blackwell for local coding agents, compares performance to Claude models, and asks for help resolving frequent crashes and malformed response issues.
Our company recently acquired a workstation with an RTX PRO 6000 Blackwell, and we're experimenting with local LLMs to reduce part of our Claude token usage. Right now we’re running Qwen3.6 27B MTP Q8_K_XL with llama.cpp on Windows 11. I've been using both Claude Opus and Sonnet for a while, and my impression is that this model feels somewhat comparable to Sonnet, but a bit weaker and slower. It is definitely better than Haiku for our use case, but not quite at Sonnet level. Opus is still in another class. That said, considering the relatively small parameter count, the model is surprisingly good at reasoning and tool calling. Its main weakness seems to be lack of knowledge. For coding, I would strongly recommend giving it access to tools like Context7 and Serper, or otherwise allowing it to check documentation and search the web. Once we did that, it became much less likely to invent or guess class names, field names, APIs, and similar details. However, we're currently running into major stability issues during coding sessions. We use VS Code with the Copilot extension. Sometimes the agent randomly stops with: I tried debugging the issue, and my current guess is that the model sometimes produces a malformed response, possibly with the wrong thinking format or with the response sections in the wrong order. Copilot then seems to interpret the response as empty. This happens randomly, but quite frequently. Sometimes the llama.cpp executable also crashes outright and terminates mid-session. We're using the latest release, and we even set up a scheduled job to rebuild llama.cpp every morning so we can keep up with updates instead of doing it manually. We switched to the MTP version because it was around 15–20% faster, with quality roughly on par with the non-MTP version. This is our llama.cpp compile command: cmake .. -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_NATIVE=ON -DGGML_LTO=ON -DGGML_CUDA_GRAPHS=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES=120 cmake --build . --config Release --target llama-server llama-bench llama-fit-params llama-cli --parallel We run 4 parallel agents, each with full context. This is our llama.cpp startup command: llama-server.exe -m "D:\DATA\models\Qwen3.6-27B-UD-Q8_K_XL_MTP.gguf" -ngl 99 -lv 4 -fa on -c 1048576 -np 4 -ctk q8_0 -ctv q8_0 --spec-draft-type-k q8_0 --spec-draft-type-v q8_0 --spec-type draft-mtp --spec-draft-n-max 2 --metrics --port 5764 --host 0.0.0.0 -b 8192 -ub 2048 --cache-prompt --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 --presence-penalty 0.0 --repeat-penalty 1.0 --reasoning-format deepseek --chat-template-kwargs "{\"preserve_thinking\":true}" --reasoning on --reasoning-format deepseek --reasoning-budget 8192 Windows and other running programs use around 3 GB of VRAM. Total VRAM usage is roughly 83 GB out of 97 GB. The workstation also has 128 GB of DDR5. This is our custom endpoint configuration in Copilot: { "name": "llama-server", "vendor": "customendpoint", "apiType": "chat-completions", "models": [ { "id": "qwen3-6-27B", "name": "Qwen3.6 27B", "url": "http://192.168.1.1:5764/v1/chat/completions", "toolCalling": true, "vision": false, "streaming": true, "maxInputTokens": 230000, "maxOutputTokens": 16000 } ] } At this point, we're a bit at a loss. This may very well be a skill issue or a lack of understanding on our part about how to properly exploit this hardware. That's why I'm asking here: does anyone with more experience running local coding agents on high-end GPUs have suggestions for improving this setup, especially the stability issues? Thanks in advance to everyone. This sub has been an amazing place to learn and discover new things!
A developer shares local inference benchmarks and systemd configurations for running the Qwen3.6-27B model on an NVIDIA RTX Pro 4500 Blackwell GPU using llama.cpp. The post requests optimization tips for throughput and explores potential use cases for larger models.
A technical guide on setting up local LLM autocomplete (Qwen2.5-Coder-7B) and agentic coding (Qwen3.6-35B-A3B) on a single 16GB GPU with 64GB+ RAM using llama.cpp, including commands and performance benchmarks.
A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.
A detailed guide for running the 35B-parameter Qwen3.6 model locally on Apple Silicon with llama.cpp to power the pi coding agent, including optimized configuration flags and sampling parameters.