@yibie: Using Local Models as Primary Coding Tools: A Practical Report from Mid-2026 There was a post on Hacker News with a straightforward title: "Is anyone using local models as their primary coding tool?" 197 comments, incredibly dense with information. A dozen real users discussed their daily configurations, pitfalls they encountered, and why they still choose local models even though they know they're not as good as...

X AI KOLs Timeline News

Summary

This article summarizes practical experiences from a Hacker News discussion about using local models (mainly Qwen 3.6 35B-A3B) as primary coding tools, including configurations, effectiveness (approximately 50-75% of frontier models), key techniques (such as preserve_thinking), and different user positions.

Using Local Models as Primary Coding Tools: A Practical Report from Mid-2026 There was a post on Hacker News with a straightforward title: "Is anyone using local models as their primary coding tool?" 197 comments, incredibly dense with information. A dozen real users discussed their daily configurations, the pitfalls they encountered, and why they still choose to use local models even though they know local models are not as good as Claude. I combined Vicki Boykis's article on testing local models, comments from Georgi Gerganov (author of llama.cpp) on HN, and the highest-quality replies from this discussion to compile a practical report on local Agent coding from mid-2026. I. Consensus Configuration If you want to set up a local agentic coding environment now, the overwhelming consensus from the HN discussion is as follows: Model: Qwen 3.6 35B-A3B. 35B total parameters, but MoE architecture only activates 3B. Everyone's evaluation is surprisingly consistent — it's the 'sweet spot.' Fast enough (55 tok/s), coding capability sufficient. Inference engine: llama.cpp. No controversy. Agent framework: Pi (http://pi.dev). Also no controversy. Gerganov himself uses the minimal configuration pi -nc --offline; others run Pi in Docker containers, restricting filesystem permissions. The downside is that Pi doesn't support plan mode, subagent, MCP client — but no one seems to complain. Quantization: Q8. Some tried more aggressive quantization (Q4), with feedback that 'loops and edit errors increased significantly.' Recommended KV cache settings: F16 K + Q8 V. Hardware: Mac Studio 128GB or Strix Halo 128GB unified memory laptop. Or RTX 5090. The core constraint is VRAM/unified memory — must fit the model + KV cache. Boykis's full configuration: Pi + LM Studio (inference server) + Docker container + Gemma-4-26b-a4b or gemma-4-12b-qat. She posted the complete models.json, Dockerfile, and docker-compose.yml in her blog. II. How Good Is It Really? Everyone is honest. No one says local models are better than Claude. The clearest comparison is as follows: "Think of Qwen 3.6 35B as a junior engineer who knows a little bit about everything—you need to guide him. Claude Opus is a senior engineer who can think about architecture with you. If Opus gives you 15x speedup, local Qwen gives you 5x speedup. Considering it's completely free, that still blows my mind." 5x vs 15x speedup. The gap is real, but 5x is not zero. Another user's quantification: Local models achieve about 75% of the effectiveness of frontier models for agentic coding. Six months ago, that number might have been under 30%. Gerganov says he's used it almost every day for the past month and a half — handling small tasks in the llama.cpp repository. Not earth-shattering, but 'definitely a useful tool for a maintainer.' His configuration is extremely minimal: pi stripped of everything + a short system prompt. III. Key Pitfalls and Fixes 1. You need to write prompts more precisely than with Claude "You really need to know what you're asking and be precise. Any assumptions left to its own decisions, it will take the easiest path to the goal — like stuffing all CSS into HTML. It won't think ahead architecturally for you." 2. Loop problem — quantization level is key "I've found that with better quantization like Q8, even if it runs slightly slower, it saves time overall — significantly fewer useless retries and edit errors." One user mentioned that the 27B version is slower but more accurate, with fewer loops. 'Wall clock time is the metric that matters to me, not tokens/sec.' 3. preserve_thinking: Qwen 3.6's killer feature This is the most technically deep thread in the discussion. When using local models for agentic coding, a common performance issue is that each round of conversation requires reprocessing the entire context. The reason is that most models' Jinja templates discard the previous round's reasoning at the end of each turn. Qwen 3.6 is the first model trained with both 'preserve reasoning' and 'discard reasoning' modes — you can enable it by setting preserve_thinking: true. chat-template-kwargs = {"preserve_thinking": true} This setting keeps the KV cache effective across multiple turns, avoiding recalculating the entire context from scratch each time. For long agentic workflows, this is a huge speedup. IV. Why Put Up with This? Two stances emerged in the discussion: Privacy/principle faction: A developer working for an EU organization said his organization doesn't have clear AI usage guidelines yet. He sees colleagues pasting source code directly into Claude, but he insists on not doing so. 'I know that anything running in my local, offline Pi container sandbox cannot leave this machine, thus cannot cause data leaks. I do this for peace of mind.' Cost faction: Someone did the math. 'I could run Gemini 3 Flash for 8 years and spend less than one Mac Studio 128GB.' Someone countered: 'But you don't buy a Mac Studio just for running LLMs. You were going to upgrade your computer anyway. And Gemini subscription is just that—a subscription. The computer can do other things.' AI skeptics: A user who calls himself an 'AI skeptic' says he's not against using AI — he's testing the boundaries of various models, probing their strengths and weaknesses. 'I'm not rejecting tools. I'm rejecting the loss of understanding.' This statement echoes the one Karpathy once quoted: 'You can outsource thinking, but you cannot outsource understanding.' Another user summarized the debate with a metaphor: "Some people are rice cooker skeptics. Some like using rice cookers, some don't. That doesn't mean one is right and the other is wrong." V. Actionable Starter Configuration If you want to try it now, here is the shortest path synthesized from the HN discussion + Boykis + Gerganov: Hardware: Mac with 64GB+ unified memory (M2/M3/M4 all work) Software stack: Install llama.cpp (or use LM Studio as an alternative), download the Qwen 3.6 35B-A3B Q8 model, configure the Pi agent. Gerganov's minimal way is pi -nc --offline, or refer to Boykis's Docker Compose configuration. Key configuration: Add chat-template-kwargs = {"preserve_thinking": true} in the model config; KV cache uses F16 K + Q8 V; AMD GPU users select Vulkan inference backend instead of ROCm. Expected effect: Approximately 50-75% of frontier model coding ability, 5x development speedup, completely free, fully offline. In a word Local models in mid-2026 have finally crossed the threshold of 'usable.' They are not a replacement for Claude — they are complementary to Claude. When you don't trust the cloud, when you want full control, when you're working on something that isn't classified but you don't want to send it out — a Mac and a Qwen are enough. And six months ago, this was impossible. Sources: • HN Discussion: https://news.ycombinator.com/item?id=48542100… • Vicki Boykis: https://vickiboykis.com/2026/06/15/running-local-models-is-good-now/… • Georgi Gerganov HN Comments #LocalModels #Qwen #agentic-coding #llamacpp
Original Article
View Cached Full Text

Cached at: 06/18/26, 02:16 PM

Local Models as Primary Coding Tools: A Mid-2026 Field Report

There was a thread on Hacker News with a straightforward title: “Anyone using local models as their primary coding tool?”

197 comments, extremely high information density. A dozen real users discussing their daily configs, pitfalls they hit, and why—despite knowing local models aren’t as good as Claude—they still choose to use them.

I aggregated Vicki Boykis’s hands-on article on local models, llama.cpp author Georgi Gerganov’s HN comments, and the highest-quality replies from that discussion into a mid-2026 practical report on local agentic coding.

I. Consensus Setup

If you want to set up a local agentic coding environment today, the overwhelming consensus from the HN discussion is:

  • Model: Qwen 3.6 35B-A3B. 35B total parameters, but MoE architecture activates only 3B. Everyone’s reviews are surprisingly unanimous—it’s the sweet spot. Fast enough (55 tok/s), coding ability sufficient.
  • Inference Engine: llama.cpp. No debate.
  • Agent Framework: Pi (http://pi.dev). Also no debate. Gerganov himself uses the minimal config pi -nc --offline; others run Pi inside Docker containers with restricted filesystem permissions. Downside: Pi doesn’t support plan mode, subagent, or MCP client—but nobody seems to complain.
  • Quantization: Q8. Some tried more aggressive quantization (Q4), feedback was “loop and edit errors increased significantly.” KV cache recommendation: F16 K + Q8 V.
  • Hardware: Mac Studio 128GB or Strix Halo 128GB unified memory laptop. Or RTX 5090. Core constraint: VRAM / unified memory—needs to hold the model plus KV cache.

Boykis’s full config: Pi + LM Studio (inference server) + Docker container + Gemma-4-26b-a4b or gemma-4-12b-qat. She posted complete models.json, Dockerfile, and docker-compose.yml in her blog.

II. How Good Is It Really?

Everyone is honest. No one claims local models are better than Claude.

The clearest comparison:

“Think of Qwen 3.6 35B as a junior engineer who knows a bit about everything—you need to guide him. Claude Opus is a senior engineer who can think through architecture with you. If Opus gives you 15x speedup, local Qwen gives you 5x speedup. Considering it’s completely free, I still find that incredible.”

5x vs 15x. The gap is real, but 5x is not zero.

Another user’s quantitative metric: local models achieve about 75% of frontier models’ effectiveness for agentic coding. Six months ago that number was probably under 30%.

Gerganov says he’s been using it almost daily for the past month and a half—handling small tasks in the llama.cpp repo. Not mind-blowing, but “definitely a useful tool for a maintainer.” His setup is extremely minimal: Pi stripped of everything plus a short system prompt.

III. Key Pitfalls and Fixes

1. You need to write prompts more precisely than with Claude

“You really need to know what you’re asking and be precise. Any assumptions left to its own devices, it will take the easiest path to the goal—like shoving all CSS into HTML. It won’t think about architecture for you.”

2. Loop problem—quantization level is key

“I found that using better quantization, like Q8, even if it runs slightly slower, saves time overall—way fewer useless retries and editing errors.”

One user mentioned that the 27B version is slower but more accurate with fewer loops than the 35B. “Wall clock time is what I care about, not tokens/sec.”

3. preserve_thinking: Qwen 3.6’s killer feature

This is the most technically deep clue in the discussion.

A common performance issue when using local models for agentic coding is that each turn re-processes the entire context. The reason is that most models’ Jinja templates discard the previous turn’s reasoning at the end of each turn. Qwen 3.6 is the first model trained for both “preserve thinking” and “don’t preserve thinking” modes—you can enable it by setting preserve_thinking: true.

chat-template-kwargs = {"preserve_thinking": true}

This setting keeps the KV cache valid across turns, avoiding recomputing the entire context each time. For long agentic workflows, this is a massive speedup.

IV. Why Put Up With This?

Two camps emerged in the discussion:

  • Privacy/Principle: A developer working for an EU organization said his org doesn’t have clear AI usage guidelines. He sees colleagues pasting source code directly into Claude, but he refuses. “I know that anything running in my local, offline Pi container sandbox can’t leave this machine, so it can’t cause a data leak. I do this for peace of mind.”

  • Cost: Someone ran the numbers. “I could run Gemini 3 Flash for 8 years and spend less than a Mac Studio 128GB.”

Others countered: “But you buy a Mac Studio not just to run LLMs. You were going to upgrade your computer anyway. And a Gemini subscription is just a subscription; the computer can do other things.”

  • AI Skeptic: A self-described “AI skeptic” said he’s not refusing to use AI—he’s testing the boundaries of various models, probing their strengths and weaknesses. “I’m not rejecting the tool. I’m rejecting losing understanding.” This comment forms a curious dialogue with Karpathy’s quote, “You can outsource thinking, but not understanding.”

Another user summarized the debate with an analogy:

“Some people are rice cooker skeptics. Some like rice cookers, some don’t. It doesn’t mean one is right or wrong.”

V. Actionable Starter Setup

If you want to try now, here’s the shortest path combining HN discussion + Boykis + Gerganov:

  • Hardware: Mac with 64GB+ unified memory (M2/M3/M4 all work)
  • Software stack: Install llama.cpp (or LM Studio as an alternative), download Qwen 3.6 35B-A3B Q8 model, configure Pi agent. Gerganov’s minimal: pi -nc --offline, or refer to Boykis’s Docker Compose config.
  • Key settings: Add chat-template-kwargs = {"preserve_thinking": true} in model config; KV cache use F16 K + Q8 V; AMD GPU users choose Vulkan backend over ROCm.
  • Expected performance: About 50–75% of frontier model coding ability, 5x development speedup, completely free, fully offline.

One Sentence

By mid-2026, local models have finally crossed the threshold of “usable.” It’s not a replacement for Claude—it’s a complement. When you don’t trust the cloud, when you want full control, when you’re working on something that’s not classified but you’d rather not send out—a Mac and a Qwen are enough.

And six months ago, that was still impossible.

Sources:

  • HN discussion: https://news.ycombinator.com/item?id=48542100…
  • Vicki Boykis: https://vickiboykis.com/2026/06/15/running-local-models-is-good-now/…
  • Georgi Gerganov HN comment

#LocalModels #Qwen #agentic-coding #llamacpp


Pi Coding Agent

Source: https://pi.dev/

Why Pi?

Pi is a minimal agent harness. Adapt Pi to your workflows, not the other way around. Customize Pi with extensions (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#extensions), skills (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#skills), prompt templates (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#prompt-templates), and themes (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#themes). Bundle them as Pi packages (https://pi.dev/packages) and share via npm or git.

Pi ships with powerful defaults but skips features like sub-agents and plan mode. Ask Pi to build what you want, or install a package that does it your way.

Four modes: interactive, print/JSON, RPC, and SDK (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#programmatic-usage). See OpenClaw (https://github.com/OpenClaw/OpenClaw) for a real-world integration.

Read the docs (https://pi.dev/docs/latest)

Change the harness, not your workflow

Pi isn’t a sealed product. If you need a command, tool, provider, workflow, or UI tweak, just ask Pi to build it. It will customize itself on the fly.

Have Pi manipulate itself in place, hit /reload, and keep going. If you think others will find what you built useful, share it!

15+ providers, hundreds of models

Anthropic, OpenAI, Google, Azure, Bedrock, Mistral, Groq, Cerebras, xAI, Hugging Face, Kimi For Coding, MiniMax, OpenRouter, Ollama, and more. Authenticate via API keys or OAuth.

Switch models mid-session with /model or Ctrl+L. Cycle through your favorites with Ctrl+P.

Add custom providers and models via models.json (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/docs/models.md) or extensions (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/docs/custom-provider.md).

Tree-structured, shareable history

Sessions are stored as trees. Use /tree to navigate to any previous point and continue from there. All branches live in a single file. Filter by message type, label entries as bookmarks.

Export to HTML with /export, or upload to a GitHub gist with /share and get a shareable URL that renders it. Example session (https://pi.dev/session/#0ea51497613daf7e1de28ee99950b074).

Context engineering

Pi’s minimal system prompt (https://github.com/earendil-works/pi/blob/main/packages/coding-agent/src/core/system-prompt.ts) and extensibility let you do actual context engineering. Control what goes into the context window and how it’s managed.

AGENTS.md: Project instructions loaded at startup from ~/.pi/agent/, parent directories, and the current directory.

SYSTEM.md: Replace or append to the default system prompt per-project.

Compaction: Auto-summarizes older messages when approaching the context limit. Fully customizable via extensions (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/custom-compaction.ts): implement topic-based compaction, code-aware summaries, or use different summarization models.

Skills: Capability packages with instructions and tools, loaded on-demand. Progressive disclosure without busting the prompt cache. See skills (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#skills).

Prompt templates: Reusable prompts as Markdown files. Type /name to expand. See prompt templates (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#prompt-templates).

Dynamic context: Extensions (https://github.com/earendil-works/pi/tree/main/packages/coding-agent#extensions) can inject messages before each turn, filter the message history, implement RAG, or build long-term memory.

Steer or follow up

Submit messages while the agent works. Enter sends a steering message (delivered after current tool, interrupts remaining tools). Alt+Enter sends a follow-up (waits until the agent finishes).

Four modes

Interactive: The full TUI experience.

Print/JSON: pi -p "query" for scripts, --mode json for event streams.

RPC: JSON protocol over stdin/stdout for non-Node integrations. See docs/rpc.md (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/docs/rpc.md).

SDK: Embed Pi in your apps. See OpenClaw (https://github.com/OpenClaw/OpenClaw) for a real-world example.

Primitives, not features

Features that other agents bake in, you can build yourself. Extensions are TypeScript modules with access to tools, commands, keyboard shortcuts, events, and the full TUI.

Sub-agents (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/subagent/), plan mode (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/plan-mode/), permission gates (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/permission-gate.ts), path protection (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/protected-paths.ts), SSH execution (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/ssh.ts), sandboxing (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/sandbox/), MCP integration, custom editors, status bars, overlays.

Don’t want to build it? Ask Pi to build it for you. Or install a package (https://pi.dev/packages) that does it your way. See the 50+ examples (https://github.com/earendil-works/pi/tree/main/packages/coding-agent/examples/extensions/).

Bundle extensions, skills, prompts, and themes as packages. Install from npm or git:

$ pi install npm:@foo/pi-tools
$ pi install git:github.com/badlogic/pi-doom

Similar Articles

@zhixianio: After receiving the new machine, I began an 'ascetic' practice of forcing myself to use local models for common tasks. I thought it would be painful, but both speed and quality greatly exceeded my expectations: Model: Qwen3.6-35B-A3B-oQ6-fp16-mtp, Running: oMLX, with N…

X AI KOLs Timeline

The author uses the Qwen3.6-35B-A3B model and oMLX tool on the new local machine for daily tasks, finding that both speed and quality far exceed expectations, even outperforming remote LLMs in PA and coding scenarios, demonstrating a significant improvement in on-device AI capabilities.

@sanbuphy: K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times…

X AI KOLs Timeline

K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times, boosting throughput from ~15 tokens/s to ~193 tokens/s, ultimately achieving 20% faster inference than LM Studio.