Idea for how to run GLM2 at a decent quant, need critique/feedback

Reddit r/LocalLLaMA News

Summary

A user proposes a hardware setup using four RTX 5060 Ti GPUs and 512 GB of DDR3 server RAM to run GLM2 at a decent quantization and seeks feedback on the idea's viability.

I am currently running a 4x 5060 ti P2P rig (64 GB VRAM total)where each card is running at gen 3 with 4 pcie lanes per card. My use case is inference only. During my benchmarking the bottleneck was compute, not pcie bandwidth for low concurrency inference tasks, such as a single user use case. This gave me an idea, since my cards are already running at gen 3 pcie, I could pickup 512 GB of DDR3 16 gb modules, a gen 3 server that has 16 dedicated pci lanes to the x16 slot, and supports 4x4 bifurcation and you might be able to get the most economically viable setup for glm2 at a decent quant without the 5 tokens per second that you get with unified memory clusters. For example Supermicro X9DRi-F / X9DR3-F supports 16 dim slots up and would support 512 gb of ram. 512 gb of ddr3 server ram is 500 dollars roughly. You can get a 5060 ti 16gb model for 425 usd if you hunt for a deal. So 1700 in GPU costs plus 500 in ram cost plus whatever the mobo and cpu costs. And with those gpus you would be able to run Qwen/Qwen3.6-27B-FP8 with bf16 kv cache at max context 262k at 72 tokens per second entirely in vram that I mentioned with my previous post. Am I missing something or would this be viable for running glm2?
Original Article

Similar Articles

Cheapest way to run GLM 5.x locally that's not a unified memory system?

Reddit r/LocalLLaMA

A discussion on the cheapest local hardware setups for running GLM 5.x and similarly sized models at 4-bit quantization, including CPU-only and multi-GPU options, with a user sharing their experience running Minimax 2.7 and Qwen 3.6 on a 5900X + 128GB DDR4 + 7900XT setup.

GLM 5.2 on 4x Sparks reasonable?

Reddit r/LocalLLaMA

A user asks about the feasibility of running GLM-5.2 at 4-bit quantization on four Ascend GX10s or DGX Sparks, wondering about speed and memory for 100k context.

Best models in 3x3090 (72GB VRAM) in Q2 2026?

Reddit r/LocalLLaMA

A user shares their experience running large LLMs on a 3x3090 (72GB VRAM) setup in Q2 2026, recommending models like GPT-OSS 120b, Qwen3.5 122b, and GLM Air 4.5 106B, and asking for newer alternatives.