GLM 5.2 on consumer hardware

Reddit r/LocalLLaMA Models

Summary

A user tested the unsloth quantized GLM-5.2 model on a high-end consumer-like system with dual RTX 5090, achieving 12 tokens per second.

I tried out the unsloth quants of GLM 5.2 on still "consumer-ish" hardware: 32C Zen5 Threadripper Pro 9975 WX, Asus WRX90E-SAGE-SE PCIe Gen5, 512GB DDR5 ECC RAM @ 4800MHz, dual RTX 5090. This machine was put together pre-RAMpocalypse, and by then not exceedingly expensive compared to today's grotesque prices. The quant I used was unsloth/GLM-5.2-GGUF, UD-Q5_K_S (492GB of weights). I used a freshly compiled (cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120f" -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_MMQ=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=OFF -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=0; cmake --build build --config Release -j 64) llama.cpp with the following invocation: CUDA_VISIBLE_DEVICES=0,1 numactl --physcpubind=0-31 --localalloc llama.cpp/build/bin/llama-server \ --model ./GLM-5.2-UD-Q5_K_S-00001-of-00012.gguf \ --temp 1.0 \ --top-p 0.95 \ --min-p 0.01 \ --fit on --no-mmap --flash-attn on --ctx-size 32768 --no-warmup --prio 3 \ --threads 32 --threads-batch 32 --numa isolate --log-verbosity 4 --split-mode layer --direct-io --jinja With this I get consistently 12t/s. I just tried chatting, no agentic stuff. There is very little to none variation of speed by omitting or using last line's llama.cpp options; same applies to the numa stuff.
Original Article

Similar Articles

GLM 5.2 on Mac Studio Speedup PR

Reddit r/LocalLLaMA

GLM 5.2 delivers major performance gains on Mac Studio with 512GB RAM, achieving prefill speeds above 100 t/s at high context lengths and enabling 4-bit quantization for contexts over 100k tokens, as detailed in a pull request by the oMLX creator.