moe-inference

#moe-inference

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

Reddit r/LocalLLaMA ↗ · 3d ago

Luce Spark is an open-source tool that enables running 35B MoE models on 16GB GPUs by intelligently caching hot experts on the GPU while keeping the rest in system RAM, using a calibrated placement and bounded async cache to maintain high throughput without the usual offload speed cliff.

0 favorites 0 likes

#moe-inference

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Reddit r/LocalLLaMA ↗ · 2026-05-13

A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.

0 favorites 0 likes

moe-inference

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Submit Feedback