Are older Titan cards still viable?

Reddit r/LocalLLaMA 06/11/26, 11:41 AM News

nvidia titan gpu llm coding hardware budget

Summary

A user explores the viability of older Nvidia Titan cards for running Gemma/Qwen MOE coding models, comparing memory bandwidth and cost against newer consumer cards.

Looking at older Nvidia cards under £200 for Gemma/Qwen MOE coding. Is there any reason to avoid older Titan 12GB cards other than being power hungry? They have more memory bandwidth than the newer consumer cards Titan X 12GB 480GB/s Titan XP 12GB 547GB/s Titan V 12GB 652GB/s RTX 2060 12GB 336GB/s RTX 2080 Ti 11GB 616GB/s RTX 3060 12GB 360GB/s

Original Article

Similar Articles

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Reddit r/LocalLLaMA

A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

Reddit r/LocalLLaMA

A user benchmarks Google's Gemma 4 QAT models on an AMD 7900 XTX, reporting up to 45% faster generation, 83% higher throughput, and significant VRAM savings (e.g., 5.7GB for the 12B QAT model) with no quality loss compared to standard weights.

Qwen3.6-35B vs Gemma4-26B on 7900 XTX

Reddit r/LocalLLaMA

A detailed benchmark comparing Qwen3.6-35B and Gemma4-26B on Radeon 7900 XTX shows Gemma is ~20% faster end-to-end despite slower token generation, because Qwen generates ~2x more tokens due to internal reasoning. The article recommends using Qwen for throughput-bound batch work and Gemma for latency-sensitive single requests.

Ran gemma 4 12b on my 3090 yesterday and I think the local model game just changed

Reddit r/artificial

A user reports running Google's Gemma 4 12B model locally on a single RTX 3090 via GGUF quantization, finding strong performance including real 256k context, multimodal capabilities, and function calling that outperforms larger 70B models for coding tasks.

@sudoingX: anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on …

X AI KOLs Timeline

A technique called luce spark allows Qwen 35B-a3B MoE model to run on a 16GB GPU (like RTX 3090) by learning which experts are frequently used and streaming the rest from RAM, achieving ~100 tok/s without VRAM bottleneck.

Similar Articles

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

Qwen3.6-35B vs Gemma4-26B on 7900 XTX

Ran gemma 4 12b on my 3090 yesterday and I think the local model game just changed

@sudoingX: anyone running a 16gb card, stop scrolling. @pupposandro and @davideciffa got qwen 35b-a3b down to 13.3gb, measured on …

Submit Feedback