Tag
This thread recommends AI models optimized for different VRAM levels, highlighting VibeThinker-3B for its strong reasoning performance at 3B parameters, along with other models for coding and general use.
A thread sharing practical tips for freeing up GPU memory in llama.cpp, such as offloading mmproj to CPU and adjusting KV cache types, while discussing parameters like --cache-type-k/v and --spec-draft-n-max.
Testing shows that default pipeline parallelism in llama.cpp wastes VRAM with no speed benefit; compiling with GGML_SCHED_MAX_COPIES=1 saves significant VRAM while maintaining identical inference speed.
Alok demonstrates running Gemma 4 26B MoE on 8GB VRAM using Unsloth's QAT quant and the -cmoe flag in llama.cpp, achieving 20 tokens/sec with 250k context, marking a major milestone for budget local AI.
A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.
This pull request for the llama.cpp inference engine implements using f16 mask for Flash Attention to reduce VRAM usage.
A developer created an experimental fork of llama.cpp that offloads only used experts instead of entire layers to VRAM, achieving speed improvements for MoE models on GPUs with limited VRAM like the RTX 2060 12GB. The author is asking for testers to validate performance on other Nvidia GPUs.
Llama.cpp's new --fit flag enables running models larger than VRAM with surprisingly high token/s, breaking the old VRAM-only limitation.
User reports successful deployment of Qwen 3.6 with ik_llama quantization achieving 50+ tokens/second on consumer hardware (16GB VRAM, 32GB RAM) with 200k context window.