Are the rich RAM /poor GPU people wrong here?

Reddit r/LocalLLaMA News

Summary

Discusses the trade-off between dense and Mixture-of-Experts (MoE) models for local AI, noting that high-RAM users have limited MoE options beyond Qwen 3.5 122B, and questioning if large GPU is the only viable path.

Hello Guys, I know everyone has his definition of local models, but for me i see 2 "reasonable" type of frontier local models. a dense one that barely fit in a 32GB ou 24GB of gpu for the most "reasonable" GPU wealthy guys and a MOE in the 100B params, the 100ish B billion params can be run on hybrid offload with a decent speed on a 128GB ram, since 128GB is the max a standard motherboard can support. Again it's cheap but common people can still afford it, it's still cheaper than a car 😄 . We see a lot of limit dense models, like qwen 27B, but for for the 100 MOE type there was only the Qwen 3.5 122B, they didn't even release the 3.6. the best MOE models range in the 30-35B. does it mean that for rich ram and poor GPU people we don't have much choice, and the big GPU was the only good road? Of course you can cram minimaxi like with Q3 or deepseek V3 in Q1. but for tool calling , speed and real usage it's barely usable. I bought a strix halo before the ram-pocalypse, but i see very few use case for the 128GB exept being able to load multiple models that can be done with llama swap
Original Article

Similar Articles