@sudoingX: anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on …
Summary
A technique called luce spark allows Qwen 35B-a3B MoE model to run on a 16GB GPU (like RTX 3090) by learning which experts are frequently used and streaming the rest from RAM, achieving ~100 tok/s without VRAM bottleneck.
View Cached Full Text
Cached at: 06/11/26, 05:43 PM
anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on a 3090 gpu.
which means a model you literally could not load before now fits, running around 100 tok/s, near what you’d get with every expert resident on a 24gb card.
the clever part is the thing everyone gets wrong about moe. it only touches ~3b of its 35b params per token, routes to about 8 of 256 experts, but you still pay full vram to keep all of them around in case they’re next.
luce spark learns which experts your traffic actually hits, pins those hot, and streams the rest from ram hidden under the matmuls so there’s no speed cliff. one flag, and it tunes itself warmer every restart.
this is the kind of work that quietly drops the whole local inference tier down a card. don’t let it scroll past.
Similar Articles
Qwen 35B-A3B is very usable with 12GB of VRAM
A user benchmarks Qwen 35B-A3B (a 35B MoE model) on a 12GB RTX 3060, finding that 12GB VRAM is a practical sweet spot for running the model with 32k context, achieving ~47 t/s generation.
Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax
Luce Spark is an open-source tool that enables running 35B MoE models on 16GB GPUs by intelligently caching hot experts on the GPU while keeping the rest in system RAM, using a calibrated placement and bounded async cache to maintain high throughput without the usual offload speed cliff.
@DeepTechTR: Qwen 3.6 27B is incredibly fast with 16 GB VRAM! The impact of Pure Quant The era of the 27B model that runs seamlessly…
Qwen 3.6 27B runs fast on 16 GB VRAM thanks to 'Pure Quant' technology, achieving 40 tokens/s with MTP and supporting 64k contexts, enabling local AI on consumer GPUs like RTX 4060 Ti.
2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache
A user shares their setup using two modded RTX 2080 Ti GPUs with 22GB VRAM each to run Qwen 3.6 27B at 38 tokens/s with llama.cpp, including tips on power limiting, tensor split mode, and KV cache settings.
Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result
A detailed account of running the Qwen3.6-35B-A3B MoE model on an 8GB laptop GPU, covering effective optimizations like --no-mmap and VRAM headroom, unexpected findings where speculative decoding improved speed by 26% contrary to benchmarks, and pitfalls with Windows and CPU bottlenecks.