Author shares a working llama-server config to run the 35B-MoE Qwen3.6 model on an 8GB RTX 4060, highlighting a max_tokens trap caused by unconstrained internal reasoning and the fix using per-request thinking_budget_tokens.
Hi all, I wanted to share a setup that’s working for me with **Qwen3.6-35B-A3B** on a laptop **RTX 4060 (8GB VRAM) + 96GB RAM**. This is **not** an interactive chat setup. I’m using it as a **coding subagent** inside an agentic pipeline, so some of the choices below are specific to that use case. TL;DR * \- Qwen3.6 35B A3B runs fine on 8GB VRAM + RAM as coding subagent * \- my real bug was not a crash: unlimited thinking consumed the whole max\_tokens budget * \- disabling thinking fixed it * \- better fix: use per-request thinking\_budget\_tokens * \- open question: best n-cpu-moe split on 8GB # Hardware / runtime * GPU: RTX 4060 Laptop, 8GB VRAM * RAM: 96GB DDR5 * Runtime: llama-server * Model: Qwen3.6-35B-A3B GGUF * Use case: coding subagent / structured pipeline work # Current server command llama-server \ -m Qwen3.6-35B-A3B-Q4_K_M.gguf \ -ngl 99 \ --n-cpu-moe 99 \ -c 50000 \ -np 1 \ -fa on \ --cache-type-k q8_0 \ --cache-type-v turbo2 \ --no-mmap \ --mlock \ --ctx-checkpoints 1 \ --cache-ram 0 \ --jinja \ --reasoning on \ --reasoning-budget -1 \ -b 2048 \ -ub 2048 **PowerShell env:** $env:LLAMA_SET_ROWS = "1" $env:LLAMA_CHAT_TEMPLATE_KWARGS = '{"preserve_thinking":true}' # Notes on the non-obvious choices * `--n-cpu-moe 99`: on 8GB VRAM, I’m currently pushing MoE layers to CPU. This is partly based on my own constraints and partly on community tuning discussions, not on official guidance. * `-np 1`: this is a single-user / single-agent setup, so I don’t want extra slots wasting RAM. * `-b 2048 -ub 2048`: in my tests this gave noticeably better prefill on prompts above \~2K tokens than lower defaults. * `LLAMA_SET_ROWS=1`: community tip, easy to try, seems worth keeping. * `preserve_thinking: true`: I’m using this because Qwen3.6 explicitly supports it, and for agent workflows it helps keep prior reasoning in cache instead of re-deriving everything every turn. # Important distinction: official vs empirical A few things here are **officially documented** for Qwen3.6: * `enable_thinking` * `preserve_thinking` * thinking mode being on by default * recommended sampling presets for coding / thinking / non-thinking use Other parts of this config are just **my current best empirical setup** or **community-derived tuning**, especially around MoE placement, KV config, and batch / ubatch choices. So I’m posting this as **“working setup + observations”**, not as a universal best config. # The trap I ran into: thinking can eat the whole output budget What initially looked like a weird bug turned out to be a budgeting issue. I’m calling llama-server through the OpenAI-compatible API with `chat.completions.create`, and I was setting `max_tokens` per request. With: * `--reasoning on` * `--reasoning-budget -1` * moderately large prompts * coding tasks that invite long internal reasoning …the model could spend the entire output budget on thinking and return no useful visible answer. In practice I saw cases like this: |max\_tokens|thinking|finish\_reason|visible code output|elapsed| |:-|:-|:-|:-|:-| |6000|ON|`length`|empty / unusable|\~190s| |10000|ON|`length`|empty / unusable|\~330s| |5000|OFF|`stop`|\~3750 tokens of clean code|\~126s| So for some coding tasks, the model wasn’t “failing” in the classic sense. It was just burning the whole budget on reasoning. # The useful part: there is a per-request fix I originally thought reasoning budget might only be controllable server-side. But llama-server supports a per-request field: { "thinking_budget_tokens": 1500 } As I understand it, this works **if you did not already fix the reasoning budget via CLI**. So the cleaner approach for my use case is probably: * don’t hardcode a global reasoning budget if I want request-level control * disable thinking for straightforward refactors * use bounded thinking for tasks that genuinely benefit from it # My current rule of thumb Right now I’m leaning toward: |Task type|Thinking|My current view| |:-|:-|:-| |Clear refactor from precise spec|OFF|better throughput, less token waste| |Moderately ambiguous coding|ON, but bounded|probably best with request-level budget| |Architecture / design tradeoffs|ON|worth the cost| |Fixed-schema extraction / structured transforms|OFF|schema does most of the work| # One more thing: soft switching thinking For Qwen3.6, I would not rely on `/think` or `/nothink` style prompting as if it were the official control surface. The documented path is `chat_template_kwargs`, especially `enable_thinking: false` when you want non-thinking mode. So my current plan is to switch modes that way instead of prompt-hacking it. # What I’d love feedback on 1. `--n-cpu-moe` **on 8GB VRAM** Has anyone found a better split than “just shove everything to CPU” on this class of hardware? 2. `-b` **/** `-ub` **tuning for very long prompts** 2048 looks good for me so far, but I’d love data points from people pushing 50K+ context regularly. 3. **KV config with Qwen3.6 in practice** I’m using `turbo2` right now based on community findings and testing. Curious what others ended up with. 4. **Thinking policy for agentic coding** If you use Qwen3.6 locally as a coding worker, when do you keep thinking on vs force it off? Happy to share more details if useful. This is part of a local knowledge-compiler / project-memory pipeline, so I care a lot more about reliable structured output than about chat UX.
The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.
A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.
A user shares their setup using two modded RTX 2080 Ti GPUs with 22GB VRAM each to run Qwen 3.6 27B at 38 tokens/s with llama.cpp, including tips on power limiting, tensor split mode, and KV cache settings.
A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.
A developer shares local inference benchmarks and systemd configurations for running the Qwen3.6-27B model on an NVIDIA RTX Pro 4500 Blackwell GPU using llama.cpp. The post requests optimization tips for throughput and explores potential use cases for larger models.