Tag
Explains how the -ncmoe flag in llama.cpp improves performance for MoE models like Qwen3.6 35B A3B on limited VRAM (8-12GB) by offloading some expert layers to CPU+RAM, with benchmarks showing up to 5x speedup on an RTX 3070Ti.
Community discussion on whether Google's TurboQuant compression can already be applied to KV cache in llama-server or if implementation is still pending.
A user reports that switching from a highly-compressed IQ4_XS quant to the larger IQ4_NL_XL quant of Qwen 3.6 dramatically improves agentic-coding accuracy, despite lower tok/s, urging others to favor bigger quants when VRAM allows.
Gemma 4’s vision performance is bottlenecked by low default token budgets; raising --image-max-tokens to 2240 in llama.cpp unlocks state-of-the-art OCR and detail recognition at the cost of ~14 GB extra VRAM.
A user seeks advice on whether to purchase a Mac (M5) or custom-built RTX 5090 for machine learning projects involving fine-tuning, custom pipelines, and image/video-heavy workflows, with curiosity about Apple's MLX framework as an alternative to NVIDIA CUDA.