Are the rich RAM /poor GPU people wrong here?
Summary
Discusses the trade-off between dense and Mixture-of-Experts (MoE) models for local AI, noting that high-RAM users have limited MoE options beyond Qwen 3.5 122B, and questioning if large GPU is the only viable path.
Similar Articles
What is the point of MoE models, beyond being faster?
A discussion about the advantages of Mixture of Experts (MoE) models over dense models beyond speed, considering RAM constraints and scaling limits.
Rotary GPU: Exploring Local Execution for Large MoE Models Under Limited VRAM
This paper introduces Rotary GPU, an exploratory execution approach that enables running large Mixture-of-Experts models on consumer hardware with limited VRAM, achieving 21 tokens/s on an RTX 4060 with 8GB. It focuses on deployment accessibility rather than architectural improvements.
High VRAM local coding model — still Qwen 3.6 27B?
The user discusses their experience with Qwen 3.6 27B for local coding tasks and asks for recommendations for larger models (100B+) suitable for systems with 224GB of VRAM.
Performance When Offloading Large Models to System RAM?
Discusses performance trade-offs of offloading large AI model weights from GPU VRAM to system RAM, comparing different GPU configurations like RTX 5090 vs RTX6000 for models like DeepSeek V4 Pro.
@andrewchen: finding the main downside with experimenting with local AI models is that you end up buying one GPU, then another, then…
Andrew Chen shares his experience of buying multiple GPUs for local AI experimentation, running Qwen3.6 27B dense at 100 tok/s on a 5090 eGPU, and compares it to Sonnet 4.6.