Idea for how to run GLM2 at a decent quant, need critique/feedback

Reddit r/LocalLLaMA 06/22/26, 07:57 PM News

hardware llm-inference gpu-setup quantization memory feedback

Summary

A user proposes a hardware setup using four RTX 5060 Ti GPUs and 512 GB of DDR3 server RAM to run GLM2 at a decent quantization and seeks feedback on the idea's viability.

I am currently running a 4x 5060 ti P2P rig (64 GB VRAM total)where each card is running at gen 3 with 4 pcie lanes per card. My use case is inference only. During my benchmarking the bottleneck was compute, not pcie bandwidth for low concurrency inference tasks, such as a single user use case. This gave me an idea, since my cards are already running at gen 3 pcie, I could pickup 512 GB of DDR3 16 gb modules, a gen 3 server that has 16 dedicated pci lanes to the x16 slot, and supports 4x4 bifurcation and you might be able to get the most economically viable setup for glm2 at a decent quant without the 5 tokens per second that you get with unified memory clusters. For example Supermicro X9DRi-F / X9DR3-F supports 16 dim slots up and would support 512 gb of ram. 512 gb of ddr3 server ram is 500 dollars roughly. You can get a 5060 ti 16gb model for 425 usd if you hunt for a deal. So 1700 in GPU costs plus 500 in ram cost plus whatever the mobo and cpu costs. And with those gpus you would be able to run Qwen/Qwen3.6-27B-FP8 with bf16 kv cache at max context 262k at 72 tokens per second entirely in vram that I mentioned with my previous post. Am I missing something or would this be viable for running glm2?

Original Article

Idea for how to run GLM2 at a decent quant, need critique/feedback

Similar Articles

Cheapest way to run GLM 5.x locally that's not a unified memory system?

GLM5.2 @7tg on 4x3090 + 192GB on budget motherboard + cpu

GLM 5.2 on 4x Sparks reasonable?

Best models in 3x3090 (72GB VRAM) in Q2 2026?

Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)

Submit Feedback

Similar Articles

Cheapest way to run GLM 5.x locally that's not a unified memory system?

GLM5.2 @7tg on 4x3090 + 192GB on budget motherboard + cpu
Running GLM5.2 with 7 trillion tokens on a budget setup using 4x RTX 3090 GPUs and 192GB RAM.

GLM 5.2 on 4x Sparks reasonable?

Best models in 3x3090 (72GB VRAM) in Q2 2026?

Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)