GLM 5.2 on consumer hardware
Summary
A user tested the unsloth quantized GLM-5.2 model on a high-end consumer-like system with dual RTX 5090, achieving 12 tokens per second.
Similar Articles
@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…
A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.
GLM5.2 @7tg on 4x3090 + 192GB on budget motherboard + cpu
Running GLM5.2 with 7 trillion tokens on a budget setup using 4x RTX 3090 GPUs and 192GB RAM.
GLM 5.2 on Mac Studio Speedup PR
GLM 5.2 delivers major performance gains on Mac Studio with 512GB RAM, achieving prefill speeds above 100 t/s at high context lengths and enabling 4-bit quantization for contexts over 100k tokens, as detailed in a pull request by the oMLX creator.
I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.
A detailed blog post describing how to dramatically speed up GLM-5.2 inference on a dual Grace Hopper system from 2.5 tok/s to over 50 tok/s by stopping model cross-module traffic and grafting an FP8 MTP head onto the INT4 base.
Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)
A user runs GLM-5.2 locally on CPU only, demonstrating how to run a large model on a modest setup.