Running Qwen3.6-35B-A3B Locally for Coding Agent: My Setup & Working Config

Reddit r/LocalLLaMA 04/22/26, 03:20 AM Tools

Summary

A detailed guide for running the 35B-parameter Qwen3.6 model locally on Apple Silicon with llama.cpp to power the pi coding agent, including optimized configuration flags and sampling parameters.

# Hardware |Component|Details| |:-|:-| |**Machine**|MacBook Pro (Mac14,6)| |**Chip**|Apple M2 Max — 12-core CPU (8P + 4E)| |**Memory**|64 GB unified memory| |**Storage**|512 GB SSD| |**OS**|macOS 15.7 (Sequoia)| # AI Agent Setup I'm using the [**pi coding agent**](https://github.com/badlogic/pi-mono/tree/main/packages/coding-agent) as my primary development assistant. It's a local-first AI coding agent that connects to local models via llama.cpp. **Model:** `Qwen3.6-35B-A3B` (running via llama.cpp) # How pi Connects to llama-server The pi agent communicates with llama-server via the OpenAI-compatible API. Configuration lives in `~/.pi/agent/models.json`: { "providers": { "llama-cpp": { "baseUrl": "http://127.0.0.1:8080/v1", "api": "openai-completions", "apiKey": "ignored", "models": [{ "id": "Qwen3.6-35B-A3B", "contextWindow": 131072, "maxTokens": 32768 }] } } } # The Command llama-server \ -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q5_K_XL \ -c 131072 \ -n 32768 \ --no-context-shift \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --repeat-penalty 1.00 \ --presence-penalty 0.00 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --batch-size 4096 \ --ubatch-size 4096 # Parameter Breakdown |Flag|Value|Why| |:-|:-|:-| |`-hf`|`unsloth/...:UD-Q5_K_XL`|HuggingFace model repo with unsloth's custom UD quantization — good quality/size tradeoff (\~19 GB)| |`-c 131072`|128K context|This model supports a massive context window — set it high for long documents or extended conversations| |`-n 32768`|32K output tokens|Allows long single-turn generations without hitting the generation limit| |`--no-context-shift`|Off|Prevents context shifting during generation — keeps long responses coherent| |`--chat-template-kwargs`|`preserve_thinking: true`|Keeps the model's reasoning/thinking blocks intact in the output| |`--batch-size 4096`|4096|Logical batch size — higher = faster prompt processing, needs more memory| |`--ubatch-size 4096`|4096|Physical batch size — kept equal to logical batch for consistency| # Sampling Parameters The sampling parameters (`--temp`, `--top-p`, `--top-k`, `--repeat-penalty`, `--presence-penalty`) are taken directly from [unsloth's recommended config for Qwen3.6](https://unsloth.ai/docs/models/qwen3.6). I use these as-is since they're the official recommendations from the model's creators and produce good results out of the box.

Original Article

Running Qwen3.6-35B-A3B Locally for Coding Agent: My Setup & Working Config

Similar Articles

@port_dev: https://x.com/port_dev/status/2054259445732110408

Qwen 3.6 27B is the sweet spot for local development

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Help optimizing llama.cpp + Qwen 27B on RTX PRO 6000 Blackwell for coding agents

Running Qwen 3.6 35b MoE With Zoo Code On M1 Max is Amazing! Fully local, battery-powered coding powerhouse!

Submit Feedback

Similar Articles

@port_dev: https://x.com/port_dev/status/2054259445732110408

Qwen 3.6 27B is the sweet spot for local development

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Help optimizing llama.cpp + Qwen 27B on RTX PRO 6000 Blackwell for coding agents

Running Qwen 3.6 35b MoE With Zoo Code On M1 Max is Amazing! Fully local, battery-powered coding powerhouse!