@port_dev: https://x.com/port_dev/status/2054259445732110408

X AI KOLs Timeline Tools

Summary

The article provides a detailed tutorial on setting up a local coding agent using Qwen3.6-27B via Unsloth Studio and the Pi coding harness. It highlights the benefits of using GGUF quantized models for efficient inference on consumer hardware like Apple Silicon Macs.

https://t.co/xQGleXAD9Y
Original Article
View Cached Full Text

Cached at: 05/13/26, 02:20 PM

Setting up Qwen3.6-27B with Unsloth and Pi coding agent

I always wanted a local coding agent that could work inside a real repo without costing me 100 bucks a month. Here is what I landed at after a lot of trial and error:

  • Unsloth Studio as the local OpenAI-compatible

  • The Unsloth GGUF build of Qwen3.6-27B for laptop or workstation inference.

  • Pi Coding Agent as the terminal coding harness.

The short version: download the GGUF, point unsloth studio run at it, then point Pi at the Studio endpoint.

Or hand the setup to an agent.

If you already use Claude Code, Codex, or another coding agent, you can skip the rest of this article. Paste this prompt and let it do the work it covers everything below, including the gotchas: Btw my computer is a mbp with m4 max and 36GB ram.

Check out the amazing https://www.whatcani.run/ by @fiveoutofnine to see if you can run the model

If you’d rather do it by hand (or want to understand each step), keep reading.

Why Qwen3.6-27B?

Qwen3.6-27B sits in a useful middle ground for coding agents. It is not a tiny autocomplete model, but it is also not a giant model that only makes sense behind a cloud API. The official Qwen model card describes it as a 27B model with a vision encoder, long context support, and a focus on agentic coding and repository-level reasoning.

For a local coding agent, that matters more than chat benchmark vibes. The agent needs to read files, reason about a codebase, make edits, recover from tool errors, and stay coherent across a long task.

For consumer hardware, I would start with Unsloth’s GGUF quant instead of the raw BF16 checkpoint. The BF16 model is useful if you have a serious multi-GPU box. The GGUF route is the practical path if you want this running on a developer machine.

What you need

On macOS, the quickest path is:

On Linux, install Pi the same way. The Unsloth install script works there too.

Confirm both are on PATH:

Download the model

The 27B GGUF at UD-Q4_K_XL is about 16 GB. Pull it once into a known path:

-C - resumes from the current byte offset, so if the connection drops you can re-run the same command and it picks up where it left off. Don’t run two copies of this command at the same time they will both write to the file and corrupt it.

You can also let Studio download the model on first run, but the CLI can time out before a cold 16 GB download finishes. Pulling the GGUF yourself is more predictable.

Start the local model server

16384 is a reasonable default on a 36 GB Apple Silicon Mac bump it up if you have more RAM (see Hardware notes below).

When the server is ready you’ll see something like:

Save the API key. Studio only stores a hash; it won’t show the plaintext key again.

Sanity check the endpoint:

You should get back a JSON list with one entry. The id is the full path you loaded. Keep that string handy.

Configure Pi

Create or edit ~/.pi/agent/models.json. Substitute your own API key and the model path you used:

The compat flags matter. Local OpenAI-compatible servers are close to the OpenAI API but not identical, and Studio rejects fields like the developer role and reasoning_effort.

You can confirm Pi sees the model:

Make pi use Qwen by default

Typing –provider unsloth-studio –model /Users/you/… every time gets old fast. Pi reads defaults from ~/.pi/agent/settings.json. Add these two keys (merge with any existing content. Pi writes its own keys like lastChangelogVersion there):

After that, plain pi in any directory uses the local model. You can still override per-invocation with –provider / –model when you want to compare against a hosted model.

Run Pi in a repo

Try a small read-only task first:

Read the project structure and explain how the app starts. Do not edit files yet.

Then a contained edit:

Find one low-risk bug or cleanup in the codebase, make the change, and run the narrowest relevant test.

This is where the setup becomes useful. Pi is the coding harness. Qwen is the local reasoning engine. Unsloth Studio is the adapter between them.

Hardware notes

–max-seq-length is the big knob. A 27B model plus a large KV cache eats memory fast.

My starting rule on macOS:

— 36 GB Apple Silicon: 16384 is the practical sweet spot long enough for real coding tasks, short enough that per-token latency stays bearable. 32768 works but generation gets noticeably slower as the context fills.

— 64 GB or more: try 32768 or 65536. 131072 only if you have headroom to spare.

— Drop seq length before changing quants if you hit OOM.

— UD-Q4_K_XL is a good quality baseline. Q3_K_XL if you need more speed.

Whatever number you pick, set contextWindow in models.json to match otherwise Pi will try to send prompts longer than the server can handle.

Qwen3.6 emits reasoning by default completions include blocks before the final answer, and those blocks stream through as part of Pi’s output. You can read along, or ignore them and skim to the answer. Generation runs around 8 tok/s on Apple Silicon at Q4 for 27B, so expect a short pause before each response.

For long coding sessions, I care more about stability than maximum benchmark performance. A local model that reliably reads, edits, and tests small changes is more useful than a maxed-out setup that crashes halfway through a task.

Where Unsloth Studio’s UI fits

unsloth studio run is headless it just serves the API. If you want a UI to browse models, test prompts, or chat with a loaded model:

Then open http://127.0.0.1:8888. The UI shares the same backend, so any model you load is also reachable on /v1. Useful for testing prompts before turning Pi loose with edit tools.

Troubleshooting

**Studio prints a TimeoutError on first run. **

The CLI calls its own server to load the model and gives up before a cold 16 GB download finishes. The server keeps downloading. Either wait for ~/.cache/huggingface/…/*.gguf to finish caching and re-run, or pre-download with the curl -C - step above (recommended).

**Pi cannot see the model. Hit the endpoint directly: **curl http://127.0.0.1:8888/v1/models -H “Authorization: Bearer $KEY” If that fails, the problem is Studio, not Pi.

**Pi errors on request fields. **Keep the compat block in models.json. Local OpenAI-compatible endpoints often reject newer fields like the developer role or reasoning_effort.

**Generation feels slow. **8 tok/s is normal for 27B at Q4 on Apple Silicon. If you’re not using the full window, drop –max-seq-length both load time and per-token latency improve.

**Studio’s sk-unsloth- key got lost. **Studio only shows it once. Stop the server, run again with –api-key-name something, and a fresh key prints on startup. The old hash is still on disk; you can prune it from the Studio settings UI.

Final take

This is the local coding-agent stack I would start with right now:

Qwen3.6-27B GGUF → Unsloth Studio on localhost → Pi Coding Agent → your repo

It is not going to replace every hosted frontier model workflow. But it is good enough to make local-first coding agents feel real: private repo context, zero per-token cost, and a model that is actually built for agentic coding instead of just casual chat.

References

Unsloth Studio docs: https://unsloth.ai/docs/new/studio

Unsloth Studio install: https://unsloth.ai/docs/new/studio/install

Unsloth Studio API: https://unsloth.ai/docs/basics/api

Qwen3.6-27B GGUF: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF

Qwen3.6-27B model card: https://huggingface.co/Qwen/Qwen3.6-27B

Pi docs: https://pi.dev/docs/latest

Pi custom models: https://pi.dev/docs/latest/models

Similar Articles

How to setup a local coding agent on macOS

Hacker News Top

A detailed tutorial on setting up a local coding agent on macOS using Gemma 4 with MTP draft model and llama.cpp, achieving ~24% speed improvement through speculative decoding.

Is anyone getting real coding work done with Qwen3.6-35B-A3B-UD-Q4_K_M on a 32GB Mac in opencode, claude code or similar?

Reddit r/LocalLLaMA

A user shares their experience running Qwen3-35B-A3B quantized model on an M2 MacBook Pro with 32GB RAM for coding tasks via opencode and llama.cpp, finding that the 32K context window limit causes critical memory loss during compaction, making complex coding tasks impractical. They conclude that meaningful agentic coding with this model likely requires at least 128K context, exceeding what their hardware can support.