Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result

Reddit r/LocalLLaMA 06/05/26, 08:25 PM News

qwen moe laptop rtx-4060 speculative-decoding optimization cpu-offload

Summary

A detailed account of running the Qwen3.6-35B-A3B MoE model on an 8GB laptop GPU, covering effective optimizations like --no-mmap and VRAM headroom, unexpected findings where speculative decoding improved speed by 26% contrary to benchmarks, and pitfalls with Windows and CPU bottlenecks.

TL;DR: I spent a long session tuning a 35B MoE on a tiny 8GB laptop GPU. Three things mattered a lot (--no-mmap, VRAM headroom, closing CPU-hungry apps). Several "obvious" optimizations did nothing because of this model's hybrid architecture (TurboQuant, Flash Attention, even i-quants made it worse). And speculative decoding gave me +26%, which contradicts the community benchmarks that found it net-negative. Looking for discussion + ideas. **The setup** \- GPU: RTX 4060 Laptop, 8GB VRAM \- CPU/RAM: i7-13620H, 32GB DDR5-5600 dual-channel \- OS: Windows 11 (llama.cpp b9484, CUDA build) \- Model: Qwen3.6-35B-A3B (MoE, 35B total / \~3B active), Q4\_K\_M (\~20GB) \- Key detail: this model is a hybrid — only 10 attention layers + 40 Gated Delta Net (recurrent) layers. That one fact explains most of my results. **Final config (the "default" profile)** \-ngl 999 --n-cpu-moe 34 -c 65536 --parallel 1 --no-mmap \--cache-type-k q4\_0 --cache-type-v q4\_0 \--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 \-md Qwen3.5-0.8B-Q4\_K\_M.gguf -ngld 99 --reasoning off All dense layers (attention/router/norms) on GPU, experts on CPU. \~39 tok/s gen on a good day, \~5.4GB VRAM, \~2.5GB headroom. **What actually helped** 1. --no-mmap is a big deal when experts are offloaded to CPU. With mmap, every token caused page faults on the expert tensors. Preloading them into RAM jumped generation speed dramatically (I measured \~11 → \~43 tok/s on an idle system). llama.cpp even prints a hint suggesting it when CPU tensor overrides are used. 2. VRAM headroom is critical on Windows. The NVIDIA driver's "System Memory Fallback" spills to system RAM instead of OOMing when VRAM is nearly full. With only \~740MB free, speed collapsed to \~7 tok/s. Keeping ≥1.5GB free fixed it. Counterintuitively, putting fewer experts on the GPU (higher --n-cpu-moe) was sometimes faster because it avoided the fallback. 3. The real bottleneck is the CPU, not the GPU. Experts run on CPU. Closing Discord + heavy browser tabs took me from \~6 to \~18 tok/s. GPU was at 59°C, never thermally throttling. **What I tested and rejected** 1. TurboQuant KV quant (turbo3/turbo4, via a fork): works, loads fine, but gave \~0 benefit. Reason: this model's KV cache for 64K context is only \~295 MiB (10 attention layers!). Compressing 295MB is pointless when 7GB of experts fill the VRAM. 2. Flash Attention: no help (same reason — almost no attention layers to accelerate). Actually slightly slower. 3. IQ4\_XS instead of Q4\_K\_M: \~35% slower (4.1 vs 6.3 tok/s same conditions). i-quants have expensive lookup-table decode that's slow on CPU; K-quants have optimized CPU kernels (REPACK=1). For CPU-offloaded experts, K-quant > i-quant even though the file is smaller. 4. \--mlock: causes CUDA error: out of memory when combined with --no-mmap (pinned host allocation), and needs a special privilege on Windows anyway. **The surprising one: speculative decoding** Community benchmarks (incl. a dedicated RTX 3090 repo) found spec-decode net-negative on Qwen3.6-35B-A3B. On my setup it gave +26% (31 → 39 tok/s) using a vocab-matched Qwen3.5-0.8B draft. My theory: with experts on CPU, generation is CPU-bound, and validating N draft tokens in one batched forward pass amortizes the expert compute better than N single-token passes. On a full-GPU 3090 the base model is already fast per token, so the draft overhead dominates. Has anyone else seen spec-decode help specifically in the CPU-offloaded-experts regime? **Bonus Windows gotchas** 1. Smart App Control silently blocked the Open WebUI desktop app's unsigned DLLs (win32job.pyd). Moved Open WebUI into WSL2 instead. 2. From WSL the Windows-host server IP changes on reboot — fixed with WSL mirrored networking so localhost:8081 is stable. **Open questions for the group** 1. Anyone else seeing spec-decode win on CPU-offloaded MoE (vs net-negative on full-GPU)? 2. For hybrid attention/recurrent models (Gated Delta Net), KV-cache optimizations seem irrelevant — what does move the needle? 3. Best way to disable thinking AND use a draft together? --chat-template-kwargs enable\_thinking:false and --reasoning-budget 0 both throw "invalid argument" when a draft is loaded (applied to the draft's template too). Only --reasoning off works. 4. Any better draft model choice than Qwen3.5-0.8B for this target? Happy to share more numbers / configs. Roast my setup.

Original Article

Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result

Similar Articles

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Tried Qwen3.6-27B-UD-Q6_K_XL.gguf with CloudeCode, well I can't believe but it is usable

@tunguz: After seeing these tweets, I decided to try it out on my own old Ubuntu computer with RTX 1070 GPU (the one that I just…

Wow! Qwen 3.6:35b-a3b on a 3090... pretty amazing.

Submit Feedback

Similar Articles

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Tried Qwen3.6-27B-UD-Q6_K_XL.gguf with CloudeCode, well I can't believe but it is usable

@tunguz: After seeing these tweets, I decided to try it out on my own old Ubuntu computer with RTX 1070 GPU (the one that I just…

Wow! Qwen 3.6:35b-a3b on a 3090... pretty amazing.