Tag
A technique called luce spark allows Qwen 35B-a3B MoE model to run on a 16GB GPU (like RTX 3090) by learning which experts are frequently used and streaming the rest from RAM, achieving ~100 tok/s without VRAM bottleneck.
Llama.cpp release B9406 fixes a crash (GGML_ASSERT) when using MTP with MoE vision models like Qwen3.6-35B-A3B.
Cohere launches Command A+, its first Mixture-of-Experts model, released under Apache 2.0 with efficient quantization for 1-2 GPU deployment, prioritizing practicality and open access for developers.