GLM5.2 @7tg on 4x3090 + 192GB on budget motherboard + cpu

@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…

X AI KOLs Timeline

A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.

GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)

Reddit r/LocalLLaMA

Speed test results for GLM-5.2 running on llama.cpp with RTX 5090 and RTX 3090 Ti, showing prefill speeds up to 579 t/s at 8k context and decode at ~10.6 t/s.

Idea for how to run GLM2 at a decent quant, need critique/feedback

Reddit r/LocalLLaMA

A user proposes a hardware setup using four RTX 5060 Ti GPUs and 512 GB of DDR3 server RAM to run GLM2 at a decent quantization and seeks feedback on the idea's viability.

@0xSero: Rejoice fellow 6000 enjoyers. We have GLM at home

X AI KOLs Following

A turnkey Docker setup to serve the GLM-5.2-NVFP4-REAP-469B model on 4× RTX PRO 6000 Blackwell GPUs using vLLM, with detailed instructions and configuration options.

@totheagi: We're the first to make the full GLM-5.2 (FP8) run on RTX 4090s. GLM-5.2 is the new 753B SOTA open-weights model, and i…