@yibie: Using Local Models as Primary Coding Tools: A Practical Report from Mid-2026 There was a post on Hacker News with a straightforward title: "Is anyone using local models as their primary coding tool?" 197 comments, incredibly dense with information. A dozen real users discussed their daily configurations, pitfalls they encountered, and why they still choose local models even though they know they're not as good as...
Summary
This article summarizes practical experiences from a Hacker News discussion about using local models (mainly Qwen 3.6 35B-A3B) as primary coding tools, including configurations, effectiveness (approximately 50-75% of frontier models), key techniques (such as preserve_thinking), and different user positions.
View Cached Full Text
Cached at: 06/18/26, 02:16 PM
Local Models as Primary Coding Tools: A Mid-2026 Field Report
There was a thread on Hacker News with a straightforward title: “Anyone using local models as their primary coding tool?”
197 comments, extremely high information density. A dozen real users discussing their daily configs, pitfalls they hit, and why—despite knowing local models aren’t as good as Claude—they still choose to use them.
I aggregated Vicki Boykis’s hands-on article on local models, llama.cpp author Georgi Gerganov’s HN comments, and the highest-quality replies from that discussion into a mid-2026 practical report on local agentic coding.
I. Consensus Setup
If you want to set up a local agentic coding environment today, the overwhelming consensus from the HN discussion is:
- Model: Qwen 3.6 35B-A3B. 35B total parameters, but MoE architecture activates only 3B. Everyone’s reviews are surprisingly unanimous—it’s the sweet spot. Fast enough (55 tok/s), coding ability sufficient.
- Inference Engine: llama.cpp. No debate.
- Agent Framework: Pi (http://pi.dev). Also no debate. Gerganov himself uses the minimal config
pi -nc --offline; others run Pi inside Docker containers with restricted filesystem permissions. Downside: Pi doesn’t support plan mode, subagent, or MCP client—but nobody seems to complain. - Quantization: Q8. Some tried more aggressive quantization (Q4), feedback was “loop and edit errors increased significantly.” KV cache recommendation: F16 K + Q8 V.
- Hardware: Mac Studio 128GB or Strix Halo 128GB unified memory laptop. Or RTX 5090. Core constraint: VRAM / unified memory—needs to hold the model plus KV cache.
Boykis’s full config: Pi + LM Studio (inference server) + Docker container + Gemma-4-26b-a4b or gemma-4-12b-qat. She posted complete models.json, Dockerfile, and docker-compose.yml in her blog.
II. How Good Is It Really?
Everyone is honest. No one claims local models are better than Claude.
The clearest comparison:
“Think of Qwen 3.6 35B as a junior engineer who knows a bit about everything—you need to guide him. Claude Opus is a senior engineer who can think through architecture with you. If Opus gives you 15x speedup, local Qwen gives you 5x speedup. Considering it’s completely free, I still find that incredible.”
5x vs 15x. The gap is real, but 5x is not zero.
Another user’s quantitative metric: local models achieve about 75% of frontier models’ effectiveness for agentic coding. Six months ago that number was probably under 30%.
Gerganov says he’s been using it almost daily for the past month and a half—handling small tasks in the llama.cpp repo. Not mind-blowing, but “definitely a useful tool for a maintainer.” His setup is extremely minimal: Pi stripped of everything plus a short system prompt.
III. Key Pitfalls and Fixes
1. You need to write prompts more precisely than with Claude
“You really need to know what you’re asking and be precise. Any assumptions left to its own devices, it will take the easiest path to the goal—like shoving all CSS into HTML. It won’t think about architecture for you.”
2. Loop problem—quantization level is key
“I found that using better quantization, like Q8, even if it runs slightly slower, saves time overall—way fewer useless retries and editing errors.”
One user mentioned that the 27B version is slower but more accurate with fewer loops than the 35B. “Wall clock time is what I care about, not tokens/sec.”
3. preserve_thinking: Qwen 3.6’s killer feature
This is the most technically deep clue in the discussion.
A common performance issue when using local models for agentic coding is that each turn re-processes the entire context. The reason is that most models’ Jinja templates discard the previous turn’s reasoning at the end of each turn. Qwen 3.6 is the first model trained for both “preserve thinking” and “don’t preserve thinking” modes—you can enable it by setting preserve_thinking: true.
chat-template-kwargs = {"preserve_thinking": true}
This setting keeps the KV cache valid across turns, avoiding recomputing the entire context each time. For long agentic workflows, this is a massive speedup.
IV. Why Put Up With This?
Two camps emerged in the discussion:
-
Privacy/Principle: A developer working for an EU organization said his org doesn’t have clear AI usage guidelines. He sees colleagues pasting source code directly into Claude, but he refuses. “I know that anything running in my local, offline Pi container sandbox can’t leave this machine, so it can’t cause a data leak. I do this for peace of mind.”
-
Cost: Someone ran the numbers. “I could run Gemini 3 Flash for 8 years and spend less than a Mac Studio 128GB.”
Others countered: “But you buy a Mac Studio not just to run LLMs. You were going to upgrade your computer anyway. And a Gemini subscription is just a subscription; the computer can do other things.”
- AI Skeptic: A self-described “AI skeptic” said he’s not refusing to use AI—he’s testing the boundaries of various models, probing their strengths and weaknesses. “I’m not rejecting the tool. I’m rejecting losing understanding.” This comment forms a curious dialogue with Karpathy’s quote, “You can outsource thinking, but not understanding.”
Another user summarized the debate with an analogy:
“Some people are rice cooker skeptics. Some like rice cookers, some don’t. It doesn’t mean one is right or wrong.”
V. Actionable Starter Setup
If you want to try now, here’s the shortest path combining HN discussion + Boykis + Gerganov:
- Hardware: Mac with 64GB+ unified memory (M2/M3/M4 all work)
- Software stack: Install llama.cpp (or LM Studio as an alternative), download Qwen 3.6 35B-A3B Q8 model, configure Pi agent. Gerganov’s minimal:
pi -nc --offline, or refer to Boykis’s Docker Compose config. - Key settings: Add
chat-template-kwargs = {"preserve_thinking": true}in model config; KV cache use F16 K + Q8 V; AMD GPU users choose Vulkan backend over ROCm. - Expected performance: About 50–75% of frontier model coding ability, 5x development speedup, completely free, fully offline.
One Sentence
By mid-2026, local models have finally crossed the threshold of “usable.” It’s not a replacement for Claude—it’s a complement. When you don’t trust the cloud, when you want full control, when you’re working on something that’s not classified but you’d rather not send out—a Mac and a Qwen are enough.
And six months ago, that was still impossible.
Sources:
- HN discussion: https://news.ycombinator.com/item?id=48542100…
- Vicki Boykis: https://vickiboykis.com/2026/06/15/running-local-models-is-good-now/…
- Georgi Gerganov HN comment
#LocalModels #Qwen #agentic-coding #llamacpp
Pi Coding Agent
Source: https://pi.dev/
Why Pi?
Pi is a minimal agent harness. Adapt Pi to your workflows, not the other way around. Customize Pi with extensions (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#extensions), skills (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#skills), prompt templates (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#prompt-templates), and themes (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#themes). Bundle them as Pi packages (https://pi.dev/packages) and share via npm or git.
Pi ships with powerful defaults but skips features like sub-agents and plan mode. Ask Pi to build what you want, or install a package that does it your way.
Four modes: interactive, print/JSON, RPC, and SDK (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#programmatic-usage). See OpenClaw (https://github.com/OpenClaw/OpenClaw) for a real-world integration.
Read the docs (https://pi.dev/docs/latest)
Change the harness, not your workflow
Pi isn’t a sealed product. If you need a command, tool, provider, workflow, or UI tweak, just ask Pi to build it. It will customize itself on the fly.
Have Pi manipulate itself in place, hit /reload, and keep going. If you think others will find what you built useful, share it!
15+ providers, hundreds of models
Anthropic, OpenAI, Google, Azure, Bedrock, Mistral, Groq, Cerebras, xAI, Hugging Face, Kimi For Coding, MiniMax, OpenRouter, Ollama, and more. Authenticate via API keys or OAuth.
Switch models mid-session with /model or Ctrl+L. Cycle through your favorites with Ctrl+P.
Add custom providers and models via models.json (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/docs/models.md) or extensions (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/docs/custom-provider.md).
Tree-structured, shareable history
Sessions are stored as trees. Use /tree to navigate to any previous point and continue from there. All branches live in a single file. Filter by message type, label entries as bookmarks.
Export to HTML with /export, or upload to a GitHub gist with /share and get a shareable URL that renders it. Example session (https://pi.dev/session/#0ea51497613daf7e1de28ee99950b074).
Context engineering
Pi’s minimal system prompt (https://github.com/earendil-works/pi/blob/main/packages/coding-agent/src/core/system-prompt.ts) and extensibility let you do actual context engineering. Control what goes into the context window and how it’s managed.
AGENTS.md: Project instructions loaded at startup from ~/.pi/agent/, parent directories, and the current directory.
SYSTEM.md: Replace or append to the default system prompt per-project.
Compaction: Auto-summarizes older messages when approaching the context limit. Fully customizable via extensions (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/custom-compaction.ts): implement topic-based compaction, code-aware summaries, or use different summarization models.
Skills: Capability packages with instructions and tools, loaded on-demand. Progressive disclosure without busting the prompt cache. See skills (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#skills).
Prompt templates: Reusable prompts as Markdown files. Type /name to expand. See prompt templates (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#prompt-templates).
Dynamic context: Extensions (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#extensions) can inject messages before each turn, filter the message history, implement RAG, or build long-term memory.
Steer or follow up
Submit messages while the agent works. Enter sends a steering message (delivered after current tool, interrupts remaining tools). Alt+Enter sends a follow-up (waits until the agent finishes).
Four modes
Interactive: The full TUI experience.
Print/JSON: pi -p "query" for scripts, --mode json for event streams.
RPC: JSON protocol over stdin/stdout for non-Node integrations. See docs/rpc.md (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/docs/rpc.md).
SDK: Embed Pi in your apps. See OpenClaw (https://github.com/OpenClaw/OpenClaw) for a real-world example.
Primitives, not features
Features that other agents bake in, you can build yourself. Extensions are TypeScript modules with access to tools, commands, keyboard shortcuts, events, and the full TUI.
Sub-agents (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/subagent/), plan mode (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/plan-mode/), permission gates (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/permission-gate.ts), path protection (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/protected-paths.ts), SSH execution (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/ssh.ts), sandboxing (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/sandbox/), MCP integration, custom editors, status bars, overlays.
Don’t want to build it? Ask Pi to build it for you. Or install a package (https://pi.dev/packages) that does it your way. See the 50+ examples (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/).
Bundle extensions, skills, prompts, and themes as packages. Install from npm or git:
$ pi install npm:@foo/pi-tools
$ pi install git:github.com/badlogic/pi-doom
Similar Articles
@zhixianio: After receiving the new machine, I began an 'ascetic' practice of forcing myself to use local models for common tasks. I thought it would be painful, but both speed and quality greatly exceeded my expectations: Model: Qwen3.6-35B-A3B-oQ6-fp16-mtp, Running: oMLX, with N…
The author uses the Qwen3.6-35B-A3B model and oMLX tool on the new local machine for daily tasks, finding that both speed and quality far exceed expectations, even outperforming remote LLMs in PA and coding scenarios, demonstrating a significant improvement in on-device AI capabilities.
@Michaelzsguo: https://x.com/Michaelzsguo/status/2053217839729791221
This article is a guide for local large model deployment, covering hardware selection, memory calculations, Runtime tool comparisons, and model quantization options, helping users from getting started to optimizing their local inference experience.
@intheworldofai: Qwen 3.7-Max is genuinely one of the most impressive agentic coding models I’ve tested in a while. I had it generate a …
阿里巴巴发布了通义千问 3.7 Max,一款专为智能体时代设计的旗舰编码模型。该模型在长周期自主执行、前端生成和3D场景构建上表现突出,多项基准测试中与顶尖闭源模型持平甚至超越,是接近前沿的中国模型。
@davis7: @0xSero helped me setup local models properly and I uh, had no idea these things had gotten this good Are they frontier…
The author highlights the impressive capabilities of the open-source Qwen 3.6-27B model running locally on an RTX 5090, noting its strong performance on programming tasks and comparing it favorably to commercial models, despite the complexity of local deployment.
@sanbuphy: K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times…
K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times, boosting throughput from ~15 tokens/s to ~193 tokens/s, ultimately achieving 20% faster inference than LM Studio.