GLM 5.2 on consumer hardware

Reddit r/LocalLLaMA 06/25/26, 03:22 PM Models

model-quantization consumer-hardware llama.cpp gguf performance glm-5.2

Summary

A user tested the unsloth quantized GLM-5.2 model on a high-end consumer-like system with dual RTX 5090, achieving 12 tokens per second.

I tried out the unsloth quants of GLM 5.2 on still "consumer-ish" hardware: 32C Zen5 Threadripper Pro 9975 WX, Asus WRX90E-SAGE-SE PCIe Gen5, 512GB DDR5 ECC RAM @ 4800MHz, dual RTX 5090. This machine was put together pre-RAMpocalypse, and by then not exceedingly expensive compared to today's grotesque prices. The quant I used was unsloth/GLM-5.2-GGUF, UD-Q5_K_S (492GB of weights). I used a freshly compiled (cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120f" -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_MMQ=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=OFF -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=0; cmake --build build --config Release -j 64) llama.cpp with the following invocation: CUDA_VISIBLE_DEVICES=0,1 numactl --physcpubind=0-31 --localalloc llama.cpp/build/bin/llama-server \ --model ./GLM-5.2-UD-Q5_K_S-00001-of-00012.gguf \ --temp 1.0 \ --top-p 0.95 \ --min-p 0.01 \ --fit on --no-mmap --flash-attn on --ctx-size 32768 --no-warmup --prio 3 \ --threads 32 --threads-batch 32 --numa isolate --log-verbosity 4 --split-mode layer --direct-io --jinja With this I get consistently 12t/s. I just tried chatting, no agentic stuff. There is very little to none variation of speed by omitting or using last line's llama.cpp options; same applies to the numa stuff.

Original Article

GLM 5.2 on consumer hardware

Similar Articles

@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…

GLM5.2 @7tg on 4x3090 + 192GB on budget motherboard + cpu

GLM 5.2 on Mac Studio Speedup PR

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)

Submit Feedback

Similar Articles

@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…

GLM5.2 @7tg on 4x3090 + 192GB on budget motherboard + cpu
Running GLM5.2 with 7 trillion tokens on a budget setup using 4x RTX 3090 GPUs and 192GB RAM.

GLM 5.2 on Mac Studio Speedup PR

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)