@tom_doerr: Runs 35B models on 16GB RAM Macs https://github.com/walter-grace/mac-code…

X AI KOLs Timeline Tools

Summary

A tool that enables running large language models like Qwen3.5-35B on 16GB Macs by streaming model weights from SSD, achieving up to 30 tok/s with an optimal configuration.

Runs 35B models on 16GB RAM Macs https://t.co/LXrWOHZthu https://t.co/OXxdMFStfo
Original Article
View Cached Full Text

Cached at: 05/11/26, 06:52 PM

Runs 35B models on 16GB RAM Macs

https://t.co/LXrWOHZthu https://t.co/OXxdMFStfo


walter-grace/mac-code

Source: https://github.com/walter-grace/mac-code

mac code

Run models that don’t fit in RAM on your Mac. $0/month.

Can I run this on my Mac?

Your MacRAMWhat you can runSpeed
Any Mac8 GBQwen3.5-9B (Q4_K_M, 5.3 GB), 4K context16-20 tok/s
Any Mac16 GBQwen3.5-9B (Q4_K_M, 5.3 GB), 64K context16-20 tok/s
Mac mini M416 GBQwen3.5-35B-A3B (IQ2_M, 10.6 GB)30 tok/s
Mac mini M416 GBQwen3-30B-A3B Q4 (17.2 GB) via Expert Sniper4.3 tok/s
Mac mini M416 GBQwen3.5-35B-A3B Q4 (19.5 GB) via Expert Sniper5.4 tok/s
Mac mini M416 GBQwen3.5-35B-A3B Q4_K_M (22 GB) via Flash Streaming1.54 tok/s
Mac mini M416 GBQwen3.5-27B (16.1 GB) via Flash Streaming0.18 tok/s
Mac mini M4 Pro48 GB35B at full Q4 in RAM30+ tok/s

“I wanted to run the Qwen 27B on my M2 16GB but failed. That’s not possible, right?”

It is possible. We stream FFN weights from SSD — only 5.5 GB stays in RAM. The output is coherent, full 4-bit quality. It’s slow (0.18 tok/s on a Mac mini M4) but the method works on any 16 GB Apple Silicon Mac. No 2-bit compression, no mmap thrashing, no swap death. See how it works.


Quick Start

35B Agent (recommended — 30 tok/s on 16 GB)

The fastest option. Uses llama.cpp with a 2-bit quantization (IQ2_M) that fits entirely in RAM.

brew install llama.cpp
pip3 install rich ddgs --break-system-packages

# Download model (10.6 GB)
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('unsloth/Qwen3.5-35B-A3B-GGUF',
    'Qwen3.5-35B-A3B-UD-IQ2_M.gguf', local_dir='$HOME/models/')
"

# Start server + agent
llama-server \
    --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 12288 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -np 1 -t 4

python3 agent.py

9B with 64K Context

python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('unsloth/Qwen3.5-9B-GGUF',
    'Qwen3.5-9B-Q4_K_M.gguf', local_dir='$HOME/models/')
"

llama-server \
    --model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 65536 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -t 4

python3 agent.py

Flash Streaming

Run models that genuinely don’t fit in RAM at full quality. No 2-bit compression. No mmap thrashing.

Measured Results

Every number below was measured on a 16 GB Mac mini M4. Nothing estimated.

ModelTotal SizeRAM UsedSpeedQuality
Qwen3-32B (dense)18.4 GB4.5 GB0.15 tok/sFull 4-bit
Qwen3.5-27B (dense hybrid)16.1 GB5.5 GB0.18 tok/sFull 4-bit
Qwen3-30B-A3B (MoE)17.2 GB8.7 GB4.3 tok/sFull 4-bit
Qwen3.5-35B-A3B (MoE)19.5 GB8.7 GB5.4 tok/sFull 4-bit

All numbers measured on M4 Mac Mini 16 GB across 5 varied prompts from cold start. Quality verified (Canberra, 8.3066, correct Python, etc). Both models: cache-aware routing bias (1.0) + co-activation prefetch + right-sized LRU cache. bias=1.0 is the universal safe maximum — quality degrades at 1.5 on both models.

Disk Requirements

ModelProcessed sizeFree disk needed
30B / Coder-30B17 GB~20 GB
35B19 GB~25 GB
80B43 GB~50 GB
122B~65 GB~70 GB
235B~130 GB~140 GB

For larger models, use an external NVMe drive:

mlx-sniper download qwen3-next-80b -o /Volumes/MySSD/qwen3-next-80b
mlx-sniper run /Volumes/MySSD/qwen3-next-80b -p "hello" -v

How Flash Streaming Works

Split the model by access pattern:

Pinned in RAM (4-6 GB): Attention weights, embeddings, norms, KV cache. Loaded once, stays forever.

Streamed from SSD per token: FFN weights (the bulk of the model). Loaded layer-by-layer, used for one matmul, discarded. Memory never grows.

For each token:
  For each layer:
    1. Run attention (from RAM — instant)
    2. Load FFN weights from SSD (~165-221 MB)
    3. Run FFN matmul on GPU
    4. Discard FFN weights — memory stays flat

For MoE models, step 2 loads only the 8 active experts (~14 MB), not all 256. That’s why MoE is 10x faster.

Run the 35B MoE Agent (1.54 tok/s, 1.42 GB RAM)

Interactive agent with web search, shell commands, and chain-of-thought. The 22 GB model on a 16 GB Mac.

Requires pre-built stream files — see research/flash-streaming/ for the split/rebuild tools.

cd research/flash-streaming
python3 moe_agent.py

Run the 27B Dense on 16 GB (0.18 tok/s, 5.5 GB RAM)

cd research/flash-streaming
pip3 install mlx-lm transformers --break-system-packages

# One-time: download model (~16 GB) and split for streaming
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('mlx-community/Qwen3.5-27B-4bit', local_dir='$HOME/models/qwen35-27b-mlx-4bit')
"
python3 split_dense_27b.py

# Run
python3 flash_stream_27b.py

Batched Union-of-Experts (5.1 tok/s)

A research prototype that verifies 8 tokens in one forward pass. Instead of loading experts for each token separately, it computes the set union of active experts across all 8 tokens (~27 unique experts per layer instead of 64) and loads them once.

cd research/flash-streaming
python3 batched_moe.py

This is verification speed (checking draft tokens), not generation speed. Useful for speculative decoding.


Agent Commands

CommandAction
/agentAgent mode (default) — search, shell, chat
/rawDirect streaming, no tools
/search <q>Quick web search
/statsSession statistics
/clearReset conversation
/quitExit

Files

Top-level (daily use)

FileWhat it does
agent.pyProduction agent — routes to search/shell/chat via llama.cpp
chat.pySimple streaming chat with llama.cpp
dashboard.pyReal-time monitoring dashboard for llama.cpp
setup.shOne-command install (llama.cpp + model download + config)
config.example.jsonExample configuration
web/Web UI (server.py + index.html)

MLX backend (mlx/)

FileWhat it does
mlx_engine.pyMLX inference server with 64K context and KV cache persistence
kv_cache.pyKV cache save/load for session persistence
paged_inference.pyPaged attention experiment
turboquant.pyTurboQuant quantization experiments
benchmark.pyBenchmarking tools

Flash Streaming research (research/flash-streaming/)

The research journey: we built, measured, and iterated. Each file represents a step.

FileWhat it doesKey discovery
flash_stream.pyv1: mmap-based streaming (0.12 tok/s)Split-model architecture works
flash_stream_v2.pyv2: F_NOCACHE direct I/O (0.15 tok/s)27% faster than mmap
flash_stream_27b.py27B dense streaming (0.18 tok/s)Method works on dense + hybrid SSM models
flash_moe.pyMoE expert-level streaming engineOnly load active experts from SSD
moe_agent.pyWorking 35B agent (1.54 tok/s, 1.42 GB)Coherent 22 GB model on 16 GB Mac
batched_moe.pyBatched Union-of-Experts (5.1 tok/s)~27 unique experts/layer, not 64
expert_io.pyF_NOCACHE + pread expert reader (8 threads)Saturate NVMe queue depth
direct_io.pyF_NOCACHE + pread for dense FFN layersBypass macOS Unified Buffer Cache
split_mlx_model.pySplit 35B MoE into pinned + per-layer experts16KB alignment for DART IOMMU
split_dense_27b.pySplit 27B dense into pinned + per-layer FFNSame technique, different architecture
convert_split.pyGGUF → split safetensors conversionGGUF is column-major
convert_aligned.pySafetensors → 16KB-aligned binaryRequired for F_NOCACHE direct I/O
dequant_gguf.pyCustom Q4_K/Q6_K dequantization (numpy)MLX can’t read GGUF Q4_K blocks
rebuild_pinned.pyRebuild pinned weights from MLX golden modelFix SSM weight dtype issues
flash_agent.py32B dense streaming agent (early version)Proof of concept
flash_stream_batched.pyBatched eval experimentProved eval sync isn’t the bottleneck
README.mdDetailed research writeup with all measurementsFull methodology and results

Key Discoveries

These are things we learned the hard way. Each links to the file where it was discovered/fixed.

  1. GGUF is column-majorflat.reshape(ne[1], ne[0]), not .reshape(ne[0], ne[1]).T. The wrong reshape gives correct shapes but garbage output. (dequant_gguf.py, convert_split.py)

  2. MLX 4-bit is 15% larger than expected — scales + biases at group_size=64 add 0.031 bytes/param. A 32B model is 18.4 GB, not 16 GB. This is why the model doesn’t fit in 16 GB RAM even at 4-bit. (research/flash-streaming/README.md)

  3. nn.quantize() silently skips MoE expertsSwitchLinear is not a subclass of nn.Linear. You must pass a class_predicate that explicitly includes it. Without this, experts run in float16 and produce garbage. (moe_agent.py)

  4. gather_qmm eliminates accumulator divergence — 8 separate quantized_matmul calls compound rounding errors across 40 layers. One batched gather_qmm call matches the reference model exactly. (batched_moe.py, flash_moe.py)

  5. F_NOCACHE is 27% faster than mmap — macOS Unified Buffer Cache adds overhead for sequential streaming workloads. fcntl(F_NOCACHE) + os.pread() with 16KB alignment bypasses it entirely. (direct_io.py, expert_io.py)

  6. setattr on nn.Module leaks memory — Injecting FFN weights into the model tree via setattr prevents garbage collection. Memory grew 3.6 GB per 16 layers. Fix: use mx.quantized_matmul directly on loaded arrays, never touch the model tree. (flash_stream.py)

  7. Batching layers doesn’t help — We tested 8-layer batches (16 evals vs 128). Zero speedup. The bottleneck is SSD I/O, not GPU sync overhead. (flash_stream_batched.py)


Architecture

┌──────────────────────────────────────────────┐
│  agent.py — LLM-as-Router                    │
│  search / shell / chat                       │
├──────────┬───────────────────────────────────┤
│ llama.cpp│  MLX backend                      │
│ (fast)   │  + KV cache save/load             │
│          │  + Flash Streaming (out-of-core)   │
│          │  + MoE Expert Sniper (SSD)        │
├──────────┴───────────────────────────────────┤
│  Apple Silicon — Unified Memory + NVMe SSD   │
└──────────────────────────────────────────────┘

Credits

License

MIT

Similar Articles

2x 512gb ram M3 Ultra mac studios

Reddit r/LocalLLaMA

A user shares their $25k hardware setup of two 512GB RAM M3 Ultra Mac Studios for running large language models locally, having tested DeepSeek V3 Q8 and GLM 5.1 Q4 via the exo distributed inference backend, while awaiting Kimi 2.6 MLX optimization.

Running local models on an M4 with 24GB memory

Hacker News Top

A guide on running local AI models like Qwen 3.5-9B on an M4 MacBook with 24GB RAM using tools like LM Studio, Ollama, and pi, including specific configuration tips for optimal performance.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.