A detailed account of running the Qwen3.6-35B-A3B MoE model on an 8GB laptop GPU, covering effective optimizations like --no-mmap and VRAM headroom, unexpected findings where speculative decoding improved speed by 26% contrary to benchmarks, and pitfalls with Windows and CPU bottlenecks.
TL;DR: I spent a long session tuning a 35B MoE on a tiny 8GB laptop GPU. Three things mattered a lot (--no-mmap, VRAM headroom, closing CPU-hungry apps). Several "obvious" optimizations did nothing because of this model's hybrid architecture (TurboQuant, Flash Attention, even i-quants made it worse). And speculative decoding gave me +26%, which contradicts the community benchmarks that found it net-negative. Looking for discussion + ideas. **The setup** \- GPU: RTX 4060 Laptop, 8GB VRAM \- CPU/RAM: i7-13620H, 32GB DDR5-5600 dual-channel \- OS: Windows 11 (llama.cpp b9484, CUDA build) \- Model: Qwen3.6-35B-A3B (MoE, 35B total / \~3B active), Q4\_K\_M (\~20GB) \- Key detail: this model is a hybrid — only 10 attention layers + 40 Gated Delta Net (recurrent) layers. That one fact explains most of my results. **Final config (the "default" profile)** \-ngl 999 --n-cpu-moe 34 -c 65536 --parallel 1 --no-mmap \--cache-type-k q4\_0 --cache-type-v q4\_0 \--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 \-md Qwen3.5-0.8B-Q4\_K\_M.gguf -ngld 99 --reasoning off All dense layers (attention/router/norms) on GPU, experts on CPU. \~39 tok/s gen on a good day, \~5.4GB VRAM, \~2.5GB headroom. **What actually helped** 1. --no-mmap is a big deal when experts are offloaded to CPU. With mmap, every token caused page faults on the expert tensors. Preloading them into RAM jumped generation speed dramatically (I measured \~11 → \~43 tok/s on an idle system). llama.cpp even prints a hint suggesting it when CPU tensor overrides are used. 2. VRAM headroom is critical on Windows. The NVIDIA driver's "System Memory Fallback" spills to system RAM instead of OOMing when VRAM is nearly full. With only \~740MB free, speed collapsed to \~7 tok/s. Keeping ≥1.5GB free fixed it. Counterintuitively, putting fewer experts on the GPU (higher --n-cpu-moe) was sometimes faster because it avoided the fallback. 3. The real bottleneck is the CPU, not the GPU. Experts run on CPU. Closing Discord + heavy browser tabs took me from \~6 to \~18 tok/s. GPU was at 59°C, never thermally throttling. **What I tested and rejected** 1. TurboQuant KV quant (turbo3/turbo4, via a fork): works, loads fine, but gave \~0 benefit. Reason: this model's KV cache for 64K context is only \~295 MiB (10 attention layers!). Compressing 295MB is pointless when 7GB of experts fill the VRAM. 2. Flash Attention: no help (same reason — almost no attention layers to accelerate). Actually slightly slower. 3. IQ4\_XS instead of Q4\_K\_M: \~35% slower (4.1 vs 6.3 tok/s same conditions). i-quants have expensive lookup-table decode that's slow on CPU; K-quants have optimized CPU kernels (REPACK=1). For CPU-offloaded experts, K-quant > i-quant even though the file is smaller. 4. \--mlock: causes CUDA error: out of memory when combined with --no-mmap (pinned host allocation), and needs a special privilege on Windows anyway. **The surprising one: speculative decoding** Community benchmarks (incl. a dedicated RTX 3090 repo) found spec-decode net-negative on Qwen3.6-35B-A3B. On my setup it gave +26% (31 → 39 tok/s) using a vocab-matched Qwen3.5-0.8B draft. My theory: with experts on CPU, generation is CPU-bound, and validating N draft tokens in one batched forward pass amortizes the expert compute better than N single-token passes. On a full-GPU 3090 the base model is already fast per token, so the draft overhead dominates. Has anyone else seen spec-decode help specifically in the CPU-offloaded-experts regime? **Bonus Windows gotchas** 1. Smart App Control silently blocked the Open WebUI desktop app's unsigned DLLs (win32job.pyd). Moved Open WebUI into WSL2 instead. 2. From WSL the Windows-host server IP changes on reboot — fixed with WSL mirrored networking so localhost:8081 is stable. **Open questions for the group** 1. Anyone else seeing spec-decode win on CPU-offloaded MoE (vs net-negative on full-GPU)? 2. For hybrid attention/recurrent models (Gated Delta Net), KV-cache optimizations seem irrelevant — what does move the needle? 3. Best way to disable thinking AND use a draft together? --chat-template-kwargs enable\_thinking:false and --reasoning-budget 0 both throw "invalid argument" when a draft is loaded (applied to the draft's template too). Only --reasoning off works. 4. Any better draft model choice than Qwen3.5-0.8B for this target? Happy to share more numbers / configs. Roast my setup.
The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.
The author shares detailed tuning tips for running the Qwen3.6-35B-A3B MoE model on an 8GB RTX 3070 Ti with up to 262k context using llama.cpp, achieving 30+ tps, and notes a 25% speed boost when switching from Windows to Ubuntu Server.
User reports surprisingly usable coding performance from Qwen3-27B-UD-Q6_K_XL.gguf running locally on RTX 5090 at ~50 tok/s with 200K context, marking a significant leap in local model quality.
A user reports successfully running Qwen3 8B locally on an older RTX 1070 GPU, demonstrating that modern LLMs can run on decade-old hardware with decent performance.
A user shares impressive results running a quantized Qwen 3.6:35b-a3b model on a used RTX 3090, achieving 160 tokens per second output after fitting the model into VRAM, and demonstrates vision capabilities with a 75-second video processing time.