@Michaelzsguo: https://x.com/Michaelzsguo/status/2056842405815447684

X AI KOLs Timeline Tools

Summary

A practical guide to organizing local LLM experiments by using a layered wrapper system and a consistent directory structure to avoid model location drift, flag amnesia, and harness coupling.

https://t.co/kANk3hpLU4
Original Article
View Cached Full Text

Cached at: 05/20/26, 06:27 AM

How to Stay Organized When You’re Running Multiple Local LLMs

§0 — TL;DR

You’re playing with Gemma 4 31B on MLX. It’s working. Then you want to compare it against Qwen3 30B-A3B — same task, different architecture. While you’re at it, why not try Gemma 4 in GGUF via llama.cpp, since you heard llama.cpp now supports MTP and you want to see if speculative decoding actually helps? You get that running. Then you want to wire up Codex CLI to it. That works, so now you’re curious whether Qwen Code or Pi gives you better tool calls on the same model. Maybe both, side by side.

Three models. Two runtimes. Two formats. Three harnesses. All in one afternoon.

The next morning: which port was Gemma on? What –prefill-step-size did you use for the MLX run that felt fast? Was the Qwen3 GGUF the Q4_K_M or the Q6_K? Which proxy was Codex pointed at?

None of it is written down. It’s all in your shell history and your head — until it isn’t.

What this article covers: one organizing rule, five layers, and a directory structure that makes every new model a five-minute addition instead of a thirty-minute archaeology session.

Two ways to read this:

  • New to local LLM setup → §1 through §4 in sequence. §3 has a directory structure you can copy directly.

  • Already running models, want the system → §2 (the rule) → §4 (full walkthrough) → §6 (adding a new model in three files).

§ 1 — The three failure modes

Before the solution, name the debt precisely. Local LLM experimentation fails in three specific ways.

Model location drift. Your GGUF files live in one place, your MLX weights somewhere else, LM Studio’s cache in a third. You know this implicitly, but it’s not written down. Six months in, you run ls and find four copies of Qwen 27B across three directories, and you’re not sure which one is current.

Flag amnesia. The flags that made a model perform well — –prefill-step-size 2048, –prompt-cache-bytes 50GB, –ctx-size 262144 — live in your shell history. They’re not in a file. When you launch that model two weeks later, you’re starting from scratch. You’ll get it running, but it won’t be the same configuration that worked.

Harness coupling. Claude Code, Codex, and Qwen Code each have their own way of pointing at a model. Without a shared model map, every harness reinvents its own configuration. Add a new model and you’re updating three places. Remove a model and you’re hoping you remembered all three.

The concrete version of this: you’ve been using Qwen 27B for coding. Two weeks later you try to bring it back. Which runtime — MLX or GGUF? What port? What flags? Was the proxy running? You can recover this. It’ll take twenty minutes of archaeology. That’s twenty minutes you’ll spend every time, for every model, until you put the information somewhere durable.

§ 2 — The rule: one job per layer

The organizing principle is a layered wrapper system. Five layers, each with exactly one job.

That gives us a few nice properties:

  • Wrappers are stable names you can type without thinking.

  • Backend details are versioned, documented, and agent-readable.

  • MLX, llama.cpp, Claude Code, Codex, and Qwen Code can share the same model profiles instead of each inventing its own map.

  • “Serving the model” and “adapting an agent to the model” are separate layers.

  • When adding a new model, we add one server profile, then thin wrapper mappings on top.

Put it in a diagram to see the data flow:

The focal layer is config. That’s where runtime flags live — model paths, ports, cache sizes, MLX tuning parameters. Everything above it is a name. Everything below it is infrastructure. Config is the only layer you edit during normal operation.

§ 3 — The directory layout

/Users/michaelguo/local-llm ├── docs/ │ ├── local-coding-model-stack.md │ └── mlx-serving.md ├── models/ │ ├── gguf/ │ └── mlx/ └── runtime/ ├── bin/ │ ├── mlx-serve │ ├── llama-serve │ ├── claude-local │ ├── codex-local │ ├── qwen-code-local │ ├── qwen-code-openai-proxy.py │ └── qwen-code-local-fetch.cjs ├── config/ │ ├── mlx-servers.zsh │ ├── llama-servers.zsh │ ├── claude-wrappers.zsh │ ├── codex-wrappers.zsh │ ├── qwen-code-wrappers.zsh │ └── qwen-code-defaults.json └── logs/ ├── mlx-qwen27b.log └── qwen-code-qwen-proxy.log

And outside the repo:

/Users/michaelguo/.local/bin/ ├── qwen-code-qwen ├── qwen-code-qwen27b ├── qwen-code-qwen35b ├── codex-qwen ├── claude-qwen └── LOCAL_LLM_WRAPPERS.md

Three decisions embedded in this layout deserve explanation.

Why runtime/config/ and not inline in the wrapper? Wrappers are stable. qwen-code-qwen is a three-line file that will never change. The flags it eventually invokes will change every time you tune a new model or update the runtime. Keeping them separate means the stable thing stays stable.

Why is models/ not inside runtime/? Model weights are data, not configuration. They’re large, they don’t version the same way, and they’re shared across runtimes — the same GGUF weights might be served by llama.cpp or LM Studio. Separating them keeps that boundary clean.

Why is docs/ a first-class directory? Because it’s the contract between sessions and between you and any agent you ask to interact with the system. More on that in §7.

The wrappers in ~/.local/bin/ sit outside the repo deliberately. They’re the stable surface — the commands you type without thinking. The repo under local-llm/ is what evolves.

§ 4 — Walking a request end-to-end

Type qwen-code-qwen. Here’s exactly what happens at each hop.

In a more detailed walkthrough

/Users/michaelguo/.local/bin/qwen-code-qwen -> /Users/michaelguo/local-llm/runtime/bin/qwen-code-local qwen -> /Users/michaelguo/local-llm/runtime/config/qwen-code-wrappers.zsh profile: qwen kind: mlx server_profile: qwen27b model_id: default_model proxy: 9211 context_window: 262144

-> /Users/michaelguo/local-llm/runtime/bin/mlx-serve start qwen27b
  -> /Users/michaelguo/local-llm/runtime/config/mlx-servers.zsh
    model path:
    /Users/michaelguo/.lmstudio/models/Brooooooklyn/Qwen3.6-27B-UD-Q6_K_XL-mlx

    server:
    0.0.0.0:8080

    MLX flags:
    --max-tokens 8192
    --prefill-step-size 2048
    --prompt-cache-bytes 50GB
    --pipeline
    --prompt-concurrency 1

-> /Users/michaelguo/local-llm/runtime/bin/qwen-code-openai-proxy.py
  proxy:
  127.0.0.1:9211 -> 127.0.0.1:8080/v1

  injects:
  {"enable_thinking": false}

-> /opt/homebrew/bin/qwen
  env:
  OPENAI_BASE_URL=http://127.0.0.1:9211/v1
  OPENAI_MODEL=default_model
  QWEN_CODE_SYSTEM_DEFAULTS_PATH=/Users/michaelguo/local-llm/runtime/config/qwen-code-defaults.json

-> MLX model server
  -> Qwen3.6 27B MLX model

§ 5 — The proxy layer: why it’s not optional

The proxy is the least intuitive layer. Most readers would try to collapse it into the harness config or the server profile. It exists as a separate layer because it does two things neither neighbor can do cleanly.

API translation. Qwen Code expects an OpenAI-compatible surface — /v1/chat/completions, standard request/response schema. The MLX server speaks its own dialect. The proxy bridges them. Without it, you’d need to patch the harness or patch the server, coupling two layers that should have no knowledge of each other.

Model-specific injection. {“enable_thinking”: false} is a Qwen3-specific flag. It controls chain-of-thought behavior at the API level. If you put it in the harness config, every model the harness talks to gets it — including models where it doesn’t apply. If you put it in the server profile, the server has to know about harness-level semantics. The proxy is the right layer: it knows which model is being served, and it’s downstream of the harness, so the harness stays clean.

The general principle: the proxy is where model-specific weirdness goes to die. Everything above the proxy sees a clean OpenAI-compatible surface. Everything below it sees standard server requests. The proxy absorbs the impedance mismatch between them.

§ 6 — Extending the system: adding a new model

Adding qwen-code-qwen35b — a different Qwen3 variant — requires touching exactly three files.

Step 1: Add a server profile in mlx-servers.zsh

zsh qwen35b) MODEL_PATH=“$HOME/.lmstudio/models/Vendor/Qwen3.5-35B-MLX” SERVER_PORT=8082 MLX_FLAGS=“–max-tokens 8192 –prefill-step-size 2048 –prompt-cache-bytes 40GB –pipeline” ;;

Step 2: Add a wrapper profile in qwen-code-wrappers.zsh

zsh qwen35b) KIND=mlx SERVER_PROFILE=qwen35b PROXY_PORT=9212 CONTEXT_WINDOW=131072 ;;

Step 3: Create the shell alias in ~/.local/bin/qwen-code-qwen35b

bash #!/bin/bash exec “LOCAL_LLM_DIR/runtime/bin/qwen-code-local" qwen35b "@”

No changes to the proxy script. No changes to the launcher. No changes to the harness config. The three new stanzas slot into an existing structure that already knows how to route them.

Compare that to the unstructured version: copying flags from shell history, editing three separate harness config files, setting a new OPENAI_BASE_URL manually, hoping you remembered which proxy port was already in use.

§ 7 — The docs layer: for you and for agents

docs/local-coding-model-stack.md has two readers.

The first is you, two weeks from now, trying to bring back a model you haven’t touched since a different project. The doc tells you what’s running, what port, what flags, and why those flags were chosen.

The second reader is an agent. When you ask Claude Code or Codex to “start my coding model,” a well-written doc gives the agent everything it needs — read the config, invoke the right launcher, verify the server is running — without you specifying the details. The doc is machine-readable by design: structured, concrete, no prose padding.

This is the shift worth naming. Documentation in most projects is human-facing and aspirational — written after the fact, describing what things should do. In this system, docs/ is a first-class operational layer. It describes what things actually do, in terms specific enough for an automated process to act on. The contract in docs/ is what makes the system agent-operable.

§ 8 — The discipline that makes experimentation sustainable

The system isn’t organized for its own sake.

When adding a new model takes five minutes instead of thirty, you try more models. When switching runtimes doesn’t break your harness, you switch runtimes. When the flags that made a session work are written down, you reproduce that session exactly. The structure enables velocity — it doesn’t slow it down.

The test for any new addition is simple: does it require touching only the right layer? A new model should only add a server profile and a wrapper profile. A new harness should only add a wrapper config. A new runtime should only add a launcher and a server profile template. If you find yourself editing a stable layer to add something new, the abstraction is wrong — fix the abstraction, not the file.

每加一个模型,应该是一次好奇心的满足,不是一次系统维护。

Similar Articles

LLMs 101: A Practical Guide (2026 Edition)

X AI KOLs

A comprehensive practical guide to LLMs covering inference mechanics, tokens, Transformers, KV cache, local deployment hardware, and quantization as of May 2026.