Tag
Luce Spark is an open-source tool that enables running 35B MoE models on 16GB GPUs by intelligently caching hot experts on the GPU while keeping the rest in system RAM, using a calibrated placement and bounded async cache to maintain high throughput without the usual offload speed cliff.
A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.