@tom_doerr: Runs 35B models on 16GB RAM Macs https://github.com/walter-grace/mac-code…

X AI KOLs Timeline 05/11/26, 10:16 AM Tools

open-source llm-inference mac memory-optimization flash-streaming qwen

Summary

A tool that enables running large language models like Qwen3.5-35B on 16GB Macs by streaming model weights from SSD, achieving up to 30 tok/s with an optimal configuration.

Runs 35B models on 16GB RAM Macs https://t.co/LXrWOHZthu https://t.co/OXxdMFStfo

Original Article

View Cached Full Text

Cached at: 05/11/26, 06:52 PM

Runs 35B models on 16GB RAM Macs

https://t.co/LXrWOHZthu https://t.co/OXxdMFStfo

walter-grace/mac-code

Source: https://github.com/walter-grace/mac-code

mac code

Run models that don’t fit in RAM on your Mac. $0/month.

Can I run this on my Mac?

Your Mac	RAM	What you can run	Speed
Any Mac	8 GB	Qwen3.5-9B (Q4_K_M, 5.3 GB), 4K context	16-20 tok/s
Any Mac	16 GB	Qwen3.5-9B (Q4_K_M, 5.3 GB), 64K context	16-20 tok/s
Mac mini M4	16 GB	Qwen3.5-35B-A3B (IQ2_M, 10.6 GB)	30 tok/s
Mac mini M4	16 GB	Qwen3-30B-A3B Q4 (17.2 GB) via Expert Sniper	4.3 tok/s
Mac mini M4	16 GB	Qwen3.5-35B-A3B Q4 (19.5 GB) via Expert Sniper	5.4 tok/s
Mac mini M4	16 GB	Qwen3.5-35B-A3B Q4_K_M (22 GB) via Flash Streaming	1.54 tok/s
Mac mini M4	16 GB	Qwen3.5-27B (16.1 GB) via Flash Streaming	0.18 tok/s
Mac mini M4 Pro	48 GB	35B at full Q4 in RAM	30+ tok/s

“I wanted to run the Qwen 27B on my M2 16GB but failed. That’s not possible, right?”

It is possible. We stream FFN weights from SSD — only 5.5 GB stays in RAM. The output is coherent, full 4-bit quality. It’s slow (0.18 tok/s on a Mac mini M4) but the method works on any 16 GB Apple Silicon Mac. No 2-bit compression, no mmap thrashing, no swap death. See how it works.

Quick Start

35B Agent (recommended — 30 tok/s on 16 GB)

The fastest option. Uses llama.cpp with a 2-bit quantization (IQ2_M) that fits entirely in RAM.

brew install llama.cpp
pip3 install rich ddgs --break-system-packages

# Download model (10.6 GB)
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('unsloth/Qwen3.5-35B-A3B-GGUF',
    'Qwen3.5-35B-A3B-UD-IQ2_M.gguf', local_dir='$HOME/models/')
"

# Start server + agent
llama-server \
    --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 12288 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -np 1 -t 4

python3 agent.py

9B with 64K Context

python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('unsloth/Qwen3.5-9B-GGUF',
    'Qwen3.5-9B-Q4_K_M.gguf', local_dir='$HOME/models/')
"

llama-server \
    --model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 65536 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -t 4

python3 agent.py

Flash Streaming

Run models that genuinely don’t fit in RAM at full quality. No 2-bit compression. No mmap thrashing.

Measured Results

Every number below was measured on a 16 GB Mac mini M4. Nothing estimated.

Model	Total Size	RAM Used	Speed	Quality
Qwen3-32B (dense)	18.4 GB	4.5 GB	0.15 tok/s	Full 4-bit
Qwen3.5-27B (dense hybrid)	16.1 GB	5.5 GB	0.18 tok/s	Full 4-bit
Qwen3-30B-A3B (MoE)	17.2 GB	8.7 GB	4.3 tok/s	Full 4-bit
Qwen3.5-35B-A3B (MoE)	19.5 GB	8.7 GB	5.4 tok/s	Full 4-bit

All numbers measured on M4 Mac Mini 16 GB across 5 varied prompts from cold start. Quality verified (Canberra, 8.3066, correct Python, etc). Both models: cache-aware routing bias (1.0) + co-activation prefetch + right-sized LRU cache. bias=1.0 is the universal safe maximum — quality degrades at 1.5 on both models.

Disk Requirements

Model	Processed size	Free disk needed
30B / Coder-30B	17 GB	~20 GB
35B	19 GB	~25 GB
80B	43 GB	~50 GB
122B	~65 GB	~70 GB
235B	~130 GB	~140 GB

For larger models, use an external NVMe drive:

mlx-sniper download qwen3-next-80b -o /Volumes/MySSD/qwen3-next-80b
mlx-sniper run /Volumes/MySSD/qwen3-next-80b -p "hello" -v

How Flash Streaming Works

Split the model by access pattern:

Pinned in RAM (4-6 GB): Attention weights, embeddings, norms, KV cache. Loaded once, stays forever.

Streamed from SSD per token: FFN weights (the bulk of the model). Loaded layer-by-layer, used for one matmul, discarded. Memory never grows.

For each token:
  For each layer:
    1. Run attention (from RAM — instant)
    2. Load FFN weights from SSD (~165-221 MB)
    3. Run FFN matmul on GPU
    4. Discard FFN weights — memory stays flat

For MoE models, step 2 loads only the 8 active experts (~14 MB), not all 256. That’s why MoE is 10x faster.

Run the 35B MoE Agent (1.54 tok/s, 1.42 GB RAM)

Interactive agent with web search, shell commands, and chain-of-thought. The 22 GB model on a 16 GB Mac.

Requires pre-built stream files — see research/flash-streaming/ for the split/rebuild tools.

cd research/flash-streaming
python3 moe_agent.py

Run the 27B Dense on 16 GB (0.18 tok/s, 5.5 GB RAM)

cd research/flash-streaming
pip3 install mlx-lm transformers --break-system-packages

# One-time: download model (~16 GB) and split for streaming
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('mlx-community/Qwen3.5-27B-4bit', local_dir='$HOME/models/qwen35-27b-mlx-4bit')
"
python3 split_dense_27b.py

# Run
python3 flash_stream_27b.py

Batched Union-of-Experts (5.1 tok/s)

A research prototype that verifies 8 tokens in one forward pass. Instead of loading experts for each token separately, it computes the set union of active experts across all 8 tokens (~27 unique experts per layer instead of 64) and loads them once.

cd research/flash-streaming
python3 batched_moe.py

This is verification speed (checking draft tokens), not generation speed. Useful for speculative decoding.

Agent Commands

Command	Action
`/agent`	Agent mode (default) — search, shell, chat
`/raw`	Direct streaming, no tools
`/search <q>`	Quick web search
`/stats`	Session statistics
`/clear`	Reset conversation
`/quit`	Exit

Files

Top-level (daily use)

File	What it does
`agent.py`	Production agent — routes to search/shell/chat via llama.cpp
`chat.py`	Simple streaming chat with llama.cpp
`dashboard.py`	Real-time monitoring dashboard for llama.cpp
`setup.sh`	One-command install (llama.cpp + model download + config)
`config.example.json`	Example configuration
`web/`	Web UI (server.py + index.html)

MLX backend (`mlx/`)

File	What it does
`mlx_engine.py`	MLX inference server with 64K context and KV cache persistence
`kv_cache.py`	KV cache save/load for session persistence
`paged_inference.py`	Paged attention experiment
`turboquant.py`	TurboQuant quantization experiments
`benchmark.py`	Benchmarking tools

Flash Streaming research (`research/flash-streaming/`)

The research journey: we built, measured, and iterated. Each file represents a step.

File	What it does	Key discovery
`flash_stream.py`	v1: mmap-based streaming (0.12 tok/s)	Split-model architecture works
`flash_stream_v2.py`	v2: F_NOCACHE direct I/O (0.15 tok/s)	27% faster than mmap
`flash_stream_27b.py`	27B dense streaming (0.18 tok/s)	Method works on dense + hybrid SSM models
`flash_moe.py`	MoE expert-level streaming engine	Only load active experts from SSD
`moe_agent.py`	Working 35B agent (1.54 tok/s, 1.42 GB)	Coherent 22 GB model on 16 GB Mac
`batched_moe.py`	Batched Union-of-Experts (5.1 tok/s)	~27 unique experts/layer, not 64
`expert_io.py`	F_NOCACHE + pread expert reader (8 threads)	Saturate NVMe queue depth
`direct_io.py`	F_NOCACHE + pread for dense FFN layers	Bypass macOS Unified Buffer Cache
`split_mlx_model.py`	Split 35B MoE into pinned + per-layer experts	16KB alignment for DART IOMMU
`split_dense_27b.py`	Split 27B dense into pinned + per-layer FFN	Same technique, different architecture
`convert_split.py`	GGUF → split safetensors conversion	GGUF is column-major
`convert_aligned.py`	Safetensors → 16KB-aligned binary	Required for F_NOCACHE direct I/O
`dequant_gguf.py`	Custom Q4_K/Q6_K dequantization (numpy)	MLX can’t read GGUF Q4_K blocks
`rebuild_pinned.py`	Rebuild pinned weights from MLX golden model	Fix SSM weight dtype issues
`flash_agent.py`	32B dense streaming agent (early version)	Proof of concept
`flash_stream_batched.py`	Batched eval experiment	Proved eval sync isn’t the bottleneck
`README.md`	Detailed research writeup with all measurements	Full methodology and results

Key Discoveries

These are things we learned the hard way. Each links to the file where it was discovered/fixed.

GGUF is column-major — flat.reshape(ne[1], ne[0]), not .reshape(ne[0], ne[1]).T. The wrong reshape gives correct shapes but garbage output. (dequant_gguf.py, convert_split.py)
MLX 4-bit is 15% larger than expected — scales + biases at group_size=64 add 0.031 bytes/param. A 32B model is 18.4 GB, not 16 GB. This is why the model doesn’t fit in 16 GB RAM even at 4-bit. (research/flash-streaming/README.md)
nn.quantize() silently skips MoE experts — SwitchLinear is not a subclass of nn.Linear. You must pass a class_predicate that explicitly includes it. Without this, experts run in float16 and produce garbage. (moe_agent.py)
gather_qmm eliminates accumulator divergence — 8 separate quantized_matmul calls compound rounding errors across 40 layers. One batched gather_qmm call matches the reference model exactly. (batched_moe.py, flash_moe.py)
F_NOCACHE is 27% faster than mmap — macOS Unified Buffer Cache adds overhead for sequential streaming workloads. fcntl(F_NOCACHE) + os.pread() with 16KB alignment bypasses it entirely. (direct_io.py, expert_io.py)
setattr on nn.Module leaks memory — Injecting FFN weights into the model tree via setattr prevents garbage collection. Memory grew 3.6 GB per 16 layers. Fix: use mx.quantized_matmul directly on loaded arrays, never touch the model tree. (flash_stream.py)
Batching layers doesn’t help — We tested 8-layer batches (16 evals vs 128). Zero speedup. The bottleneck is SSD I/O, not GPU sync overhead. (flash_stream_batched.py)

Architecture

┌──────────────────────────────────────────────┐
│  agent.py — LLM-as-Router                    │
│  search / shell / chat                       │
├──────────┬───────────────────────────────────┤
│ llama.cpp│  MLX backend                      │
│ (fast)   │  + KV cache save/load             │
│          │  + Flash Streaming (out-of-core)   │
│          │  + MoE Expert Sniper (SSD)        │
├──────────┴───────────────────────────────────┤
│  Apple Silicon — Unified Memory + NVMe SSD   │
└──────────────────────────────────────────────┘

Credits

Qwen3.5 — the models
llama.cpp — inference engine
MLX — Apple’s ML framework
Unsloth — GGUF quantizations
mlx-community — pre-converted MLX models

License

MIT

@tom_doerr: Runs 35B models on 16GB RAM Macs https://github.com/walter-grace/mac-code…

walter-grace/mac-code

mac code

Can I run this on my Mac?

Quick Start

35B Agent (recommended — 30 tok/s on 16 GB)

9B with 64K Context

Flash Streaming

Measured Results

Disk Requirements

How Flash Streaming Works

Run the 35B MoE Agent (1.54 tok/s, 1.42 GB RAM)

Run the 27B Dense on 16 GB (0.18 tok/s, 5.5 GB RAM)

Batched Union-of-Experts (5.1 tok/s)

Agent Commands

Files

Top-level (daily use)

MLX backend (`mlx/`)

Flash Streaming research (`research/flash-streaming/`)

Key Discoveries

Architecture

Credits

License

Similar Articles

2x 512gb ram M3 Ultra mac studios

@remilouf: Following @julien_c’s tweet I bought a MacBook Pro with 128B unified memory, and started running Qwen3.6 as my daily dr…

Running local models on an M4 with 24GB memory

@tom_doerr: Runs 70B LLMs on single 4GB GPU https://github.com/lyogavin/airllm

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Submit Feedback

Similar Articles

2x 512gb ram M3 Ultra mac studios

@remilouf: Following @julien_c’s tweet I bought a MacBook Pro with 128B unified memory, and started running Qwen3.6 as my daily dr…

Running local models on an M4 with 24GB memory

@tom_doerr: Runs 70B LLMs on single 4GB GPU https://github.com/lyogavin/airllm

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

walter-grace/mac-code

mac code

Can I run this on my Mac?

Quick Start

35B Agent (recommended — 30 tok/s on 16 GB)

9B with 64K Context

Flash Streaming

Measured Results

Disk Requirements

How Flash Streaming Works

Run the 35B MoE Agent (1.54 tok/s, 1.42 GB RAM)

Run the 27B Dense on 16 GB (0.18 tok/s, 5.5 GB RAM)

Batched Union-of-Experts (5.1 tok/s)

Agent Commands

Files

Top-level (daily use)

MLX backend (mlx/)

Flash Streaming research (research/flash-streaming/)

Key Discoveries

Architecture

Credits

License

Similar Articles

2x 512gb ram M3 Ultra mac studios

@remilouf: Following @julien_c’s tweet I bought a MacBook Pro with 128B unified memory, and started running Qwen3.6 as my daily dr…

Running local models on an M4 with 24GB memory

@tom_doerr: Runs 70B LLMs on single 4GB GPU https://github.com/lyogavin/airllm

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Submit Feedback

MLX backend (`mlx/`)

Flash Streaming research (`research/flash-streaming/`)