A detailed guide for running the 35B-parameter Qwen3.6 model locally on Apple Silicon with llama.cpp to power the pi coding agent, including optimized configuration flags and sampling parameters.
# Hardware |Component|Details| |:-|:-| |**Machine**|MacBook Pro (Mac14,6)| |**Chip**|Apple M2 Max — 12-core CPU (8P + 4E)| |**Memory**|64 GB unified memory| |**Storage**|512 GB SSD| |**OS**|macOS 15.7 (Sequoia)| # AI Agent Setup I'm using the [**pi coding agent**](https://github.com/badlogic/pi-mono/tree/main/packages/coding-agent) as my primary development assistant. It's a local-first AI coding agent that connects to local models via llama.cpp. **Model:** `Qwen3.6-35B-A3B` (running via llama.cpp) # How pi Connects to llama-server The pi agent communicates with llama-server via the OpenAI-compatible API. Configuration lives in `~/.pi/agent/models.json`: { "providers": { "llama-cpp": { "baseUrl": "http://127.0.0.1:8080/v1", "api": "openai-completions", "apiKey": "ignored", "models": [{ "id": "Qwen3.6-35B-A3B", "contextWindow": 131072, "maxTokens": 32768 }] } } } # The Command llama-server \ -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q5_K_XL \ -c 131072 \ -n 32768 \ --no-context-shift \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --repeat-penalty 1.00 \ --presence-penalty 0.00 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --batch-size 4096 \ --ubatch-size 4096 # Parameter Breakdown |Flag|Value|Why| |:-|:-|:-| |`-hf`|`unsloth/...:UD-Q5_K_XL`|HuggingFace model repo with unsloth's custom UD quantization — good quality/size tradeoff (\~19 GB)| |`-c 131072`|128K context|This model supports a massive context window — set it high for long documents or extended conversations| |`-n 32768`|32K output tokens|Allows long single-turn generations without hitting the generation limit| |`--no-context-shift`|Off|Prevents context shifting during generation — keeps long responses coherent| |`--chat-template-kwargs`|`preserve_thinking: true`|Keeps the model's reasoning/thinking blocks intact in the output| |`--batch-size 4096`|4096|Logical batch size — higher = faster prompt processing, needs more memory| |`--ubatch-size 4096`|4096|Physical batch size — kept equal to logical batch for consistency| # Sampling Parameters The sampling parameters (`--temp`, `--top-p`, `--top-k`, `--repeat-penalty`, `--presence-penalty`) are taken directly from [unsloth's recommended config for Qwen3.6](https://unsloth.ai/docs/models/qwen3.6). I use these as-is since they're the official recommendations from the model's creators and produce good results out of the box.
The article provides a detailed tutorial on setting up a local coding agent using Qwen3.6-27B via Unsloth Studio and the Pi coding harness. It highlights the benefits of using GGUF quantized models for efficient inference on consumer hardware like Apple Silicon Macs.
Qwen 3.6 27B is praised as a powerful local AI model that outperforms expectations for general intelligence, suitable for practical tasks like code generation, and runs easily with llama.cpp.
The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.
A user details their setup running Qwen 27B with llama.cpp on an RTX PRO 6000 Blackwell for local coding agents, compares performance to Claude models, and asks for help resolving frequent crashes and malformed response issues.
The article discusses running the Qwen 3.6 35b Mixture-of-Experts model locally on an Apple M1 Max Mac using Zoo Code, highlighting its capabilities as a battery-powered coding assistant.