@sudoingX: anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on …

X AI KOLs Timeline News

Summary

A technique called luce spark allows Qwen 35B-a3B MoE model to run on a 16GB GPU (like RTX 3090) by learning which experts are frequently used and streaming the rest from RAM, achieving ~100 tok/s without VRAM bottleneck.

anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on a 3090 gpu. which means a model you literally could not load before now fits, running around 100 tok/s, near what you'd get with every expert resident on a 24gb card. the clever part is the thing everyone gets wrong about moe. it only touches ~3b of its 35b params per token, routes to about 8 of 256 experts, but you still pay full vram to keep all of them around in case they're next. luce spark learns which experts your traffic actually hits, pins those hot, and streams the rest from ram hidden under the matmuls so there's no speed cliff. one flag, and it tunes itself warmer every restart. this is the kind of work that quietly drops the whole local inference tier down a card. don't let it scroll past.
Original Article
View Cached Full Text

Cached at: 06/11/26, 05:43 PM

anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on a 3090 gpu.

which means a model you literally could not load before now fits, running around 100 tok/s, near what you’d get with every expert resident on a 24gb card.

the clever part is the thing everyone gets wrong about moe. it only touches ~3b of its 35b params per token, routes to about 8 of 256 experts, but you still pay full vram to keep all of them around in case they’re next.

luce spark learns which experts your traffic actually hits, pins those hot, and streams the rest from ram hidden under the matmuls so there’s no speed cliff. one flag, and it tunes itself warmer every restart.

this is the kind of work that quietly drops the whole local inference tier down a card. don’t let it scroll past.

Similar Articles

Qwen 35B-A3B is very usable with 12GB of VRAM

Reddit r/LocalLLaMA

A user benchmarks Qwen 35B-A3B (a 35B MoE model) on a 12GB RTX 3060, finding that 12GB VRAM is a practical sweet spot for running the model with 32k context, achieving ~47 t/s generation.

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

Reddit r/LocalLLaMA

Luce Spark is an open-source tool that enables running 35B MoE models on 16GB GPUs by intelligently caching hot experts on the GPU while keeping the rest in system RAM, using a calibrated placement and bounded async cache to maintain high throughput without the usual offload speed cliff.