vram-optimization

#vram-optimization

@0xSero: Best models for your hardware - 4gb to 12gb vram - VibeThinker-3B - smokes everything remotely close to its weight clas…

X AI KOLs Timeline ↗ · 16h ago Cached

This thread recommends AI models optimized for different VRAM levels, highlighting VibeThinker-3B for its strong reasoning performance at 3B parameters, along with other models for coding and general use.

0 favorites 0 likes

#vram-optimization

llama.cpp - how to free up even more space on your GPU

Reddit r/LocalLLaMA ↗ · yesterday

A thread sharing practical tips for freeing up GPU memory in llama.cpp, such as offloading mmproj to CPU and adjusting KV cache types, while discussing parameters like --cache-type-k/v and --spec-draft-n-max.

0 favorites 0 likes

#vram-optimization

Pipeline parallelism in llama.cpp may be wasting your VRAM

Reddit r/LocalLLaMA ↗ · 2026-06-08

Testing shows that default pipeline parallelism in llama.cpp wastes VRAM with no speed benefit; compiling with GGML_SCHED_MAX_COPIES=1 saves significant VRAM while maintaining identical inference speed.

0 favorites 0 likes

#vram-optimization

@analogalok: Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what yo…

X AI KOLs Timeline ↗ · 2026-06-07 Cached

Alok demonstrates running Gemma 4 26B MoE on 8GB VRAM using Unsloth's QAT quant and the -cmoe flag in llama.cpp, achieving 20 tokens/sec with 250k context, marking a major milestone for budget local AI.

0 favorites 0 likes

#vram-optimization

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist

Reddit r/LocalLLaMA ↗ · 2026-06-04

A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.

0 favorites 0 likes

#vram-optimization

llama: use f16 mask for FA to save VRAM by am17an · Pull Request #23764 · ggml-org/llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-05-29 Cached

This pull request for the llama.cpp inference engine implements using f16 mask for Flash Attention to reduce VRAM usage.

0 favorites 0 likes

#vram-optimization

Experts first llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-05-22

A developer created an experimental fork of llama.cpp that offloads only used experts instead of entire layers to VRAM, achieving speed improvements for MoE models on GPUs with limited VRAM like the RTX 2060 12GB. The author is asking for testers to validate performance on other Nvidia GPUs.

0 favorites 0 likes

#vram-optimization

Llama.cpp's auto fit works much better than I expected

Reddit r/LocalLLaMA ↗ · 2026-04-21

Llama.cpp's new --fit flag enables running models larger than VRAM with surprisingly high token/s, breaking the old VRAM-only limitation.

0 favorites 0 likes

#vram-optimization

QWEN3.6 + ik_llama is fast af

Reddit r/LocalLLaMA ↗ · 2026-04-19

User reports successful deployment of Qwen 3.6 with ik_llama quantization achieving 50+ tokens/second on consumer hardware (16GB VRAM, 32GB RAM) with 200k context window.

0 favorites 0 likes

vram-optimization

Submit Feedback