vram

Tag

Cards List
#vram

@leftcurvedev_: Anyone with 8GB or 12GB VRAM setups needs to understand that "-ncmoe" is the key flag to boost performance on llama.cpp…

X AI KOLs Timeline · 2d ago

Explains how the -ncmoe flag in llama.cpp improves performance for MoE models like Qwen3.6 35B A3B on limited VRAM (8-12GB) by offloading some expert layers to CPU+RAM, with benchmarks showing up to 5x speedup on an RTX 3070Ti.

0 favorites 0 likes
#vram

Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR?

Reddit r/LocalLLaMA · 2026-04-22

Community discussion on whether Google's TurboQuant compression can already be applied to KV cache in llama-server or if implementation is still pending.

0 favorites 0 likes
#vram

Consider running a bigger quant if possible

Reddit r/LocalLLaMA · 2026-04-22

A user reports that switching from a highly-compressed IQ4_XS quant to the larger IQ4_NL_XL quant of Qwen 3.6 dramatically improves agentic-coding accuracy, despite lower tok/s, urging others to favor bigger quants when VRAM allows.

0 favorites 0 likes
#vram

Gemma 4 Vision

Reddit r/LocalLLaMA · 2026-04-21

Gemma 4’s vision performance is bottlenecked by low default token budgets; raising --image-max-tokens to 2240 in llama.cpp unlocks state-of-the-art OCR and detail recognition at the cost of ~14 GB extra VRAM.

0 favorites 0 likes
#vram

Which computer should I buy: Mac or custom-built 5090? [D]

Reddit r/MachineLearning · 2026-04-17

A user seeks advice on whether to purchase a Mac (M5) or custom-built RTX 5090 for machine learning projects involving fine-tuning, custom pipelines, and image/video-heavy workflows, with curiosity about Apple's MLX framework as an alternative to NVIDIA CUDA.

0 favorites 0 likes
← Back to home

Submit Feedback