Are older Titan cards still viable?
Summary
A user explores the viability of older Nvidia Titan cards for running Gemma/Qwen MOE coding models, comparing memory bandwidth and cost against newer consumer cards.
Similar Articles
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)
A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.
Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss
A user benchmarks Google's Gemma 4 QAT models on an AMD 7900 XTX, reporting up to 45% faster generation, 83% higher throughput, and significant VRAM savings (e.g., 5.7GB for the 12B QAT model) with no quality loss compared to standard weights.
Qwen3.6-35B vs Gemma4-26B on 7900 XTX
A detailed benchmark comparing Qwen3.6-35B and Gemma4-26B on Radeon 7900 XTX shows Gemma is ~20% faster end-to-end despite slower token generation, because Qwen generates ~2x more tokens due to internal reasoning. The article recommends using Qwen for throughput-bound batch work and Gemma for latency-sensitive single requests.
Ran gemma 4 12b on my 3090 yesterday and I think the local model game just changed
A user reports running Google's Gemma 4 12B model locally on a single RTX 3090 via GGUF quantization, finding strong performance including real 256k context, multimodal capabilities, and function calling that outperforms larger 70B models for coding tasks.
@sudoingX: anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on …
A technique called luce spark allows Qwen 35B-a3B MoE model to run on a 16GB GPU (like RTX 3090) by learning which experts are frequently used and streaming the rest from RAM, achieving ~100 tok/s without VRAM bottleneck.