@sudoingX: anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on …

X AI KOLs Timeline 06/10/26, 08:13 PM News

inference-optimization moe-model vram-efficiency local-inference qwen-35b-a3b luce-spark technique

Summary

A technique called luce spark allows Qwen 35B-a3B MoE model to run on a 16GB GPU (like RTX 3090) by learning which experts are frequently used and streaming the rest from RAM, achieving ~100 tok/s without VRAM bottleneck.

anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on a 3090 gpu. which means a model you literally could not load before now fits, running around 100 tok/s, near what you'd get with every expert resident on a 24gb card. the clever part is the thing everyone gets wrong about moe. it only touches ~3b of its 35b params per token, routes to about 8 of 256 experts, but you still pay full vram to keep all of them around in case they're next. luce spark learns which experts your traffic actually hits, pins those hot, and streams the rest from ram hidden under the matmuls so there's no speed cliff. one flag, and it tunes itself warmer every restart. this is the kind of work that quietly drops the whole local inference tier down a card. don't let it scroll past.

Original Article

View Cached Full Text

Cached at: 06/11/26, 05:43 PM

anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on a 3090 gpu.

which means a model you literally could not load before now fits, running around 100 tok/s, near what you’d get with every expert resident on a 24gb card.

the clever part is the thing everyone gets wrong about moe. it only touches ~3b of its 35b params per token, routes to about 8 of 256 experts, but you still pay full vram to keep all of them around in case they’re next.

luce spark learns which experts your traffic actually hits, pins those hot, and streams the rest from ram hidden under the matmuls so there’s no speed cliff. one flag, and it tunes itself warmer every restart.

this is the kind of work that quietly drops the whole local inference tier down a card. don’t let it scroll past.

@sudoingX: anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on …

Similar Articles

Qwen 35B-A3B is very usable with 12GB of VRAM

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

@DeepTechTR: Qwen 3.6 27B is incredibly fast with 16 GB VRAM! The impact of Pure Quant The era of the 27B model that runs seamlessly…

2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache

Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result

Submit Feedback

Similar Articles

Qwen 35B-A3B is very usable with 12GB of VRAM

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

@DeepTechTR: Qwen 3.6 27B is incredibly fast with 16 GB VRAM! The impact of Pure Quant The era of the 27B model that runs seamlessly…

2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache

Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result