@0xSero: Rejoice fellow 6000 enjoyers. We have GLM at home
Summary
A turnkey Docker setup to serve the GLM-5.2-NVFP4-REAP-469B model on 4× RTX PRO 6000 Blackwell GPUs using vLLM, with detailed instructions and configuration options.
View Cached Full Text
Cached at: 06/19/26, 12:13 AM
Rejoice fellow 6000 enjoyers.
We have GLM at home https://t.co/pAUQf4WPpe
0xSero/glm-5.2-sm120
Source: https://github.com/0xSero/glm-5.2-sm120
GLM-5.2-NVFP4-REAP-469B — vLLM serving (4× RTX PRO 6000 Blackwell)
A turnkey Docker setup to serve 0xSero/GLM-5.2-NVFP4-REAP-469B
(REAP-pruned, NVFP4, DeepSeek-Sparse-Attention) on 4× NVIDIA RTX PRO 6000 Blackwell
(SM120, 96 GB each) with the voipmonitor b12x vLLM image.
Model: huggingface.co/0xSero/GLM-5.2-NVFP4-REAP-469B · ~313 GB on disk (NVFP4) · REAP-pruned 469B MoE · DeepSeek Sparse Attention + MTP.
Validated config: DCP=4 · 250k context · ~3× concurrency · ~60 tok/s decode · fp8 KV cache · MTP speculative decode · tool-calling · thinking off by default (toggle on).
Hardware target
| GPUs | 4× RTX PRO 6000 Blackwell (SM120), 96 GB each, no NVLink (PCIe) |
| Model on disk | ~313 GB (NVFP4), ~78.6 GB/GPU resident |
| Interconnect | PCIe — requires NCCL_P2P_DISABLE=1 (see below) |
Prerequisites
- Docker + the NVIDIA Container Toolkit.
- Access to the b12x image (
voipmonitor/vllm:black-benediction-…). It is the only image that bundlesGlmMoeDsaForCausalLM+Glm4MoeMTPModel+ the SM120 sparse-MLA kernel (B12X_MLA_SPARSE) + the ModelOpt NVFP4 MoE loader. - The model weights on a local path (default
/mnt/llm_models/GLM-5.2-NVFP4-REAP-469B).
Quick start — one command
# 1. Download the weights (~313 GB NVFP4) — needs the hf CLI: pip install -U huggingface_hub
hf download 0xSero/GLM-5.2-NVFP4-REAP-469B --local-dir /mnt/llm_models/GLM-5.2-NVFP4-REAP-469B
# 2. (optional) point at a different weights path / port
cp .env.example .env # defaults already target /mnt/llm_models/... on port 8000
# 3. Launch. This starts the server, waits for it, and smoke-tests it.
./launch.sh
launch.sh blocks until the server is actually answering, then prints:
✅ READY — GLM-5.2 is serving and answered: READY
Endpoint : http://localhost:8000/v1 (model: GLM-5.2-NVFP4-REAP-469B)
Try it : ./chat.sh "write a haiku about GPUs"
(First boot compiles kernels + captures CUDA graphs, ~6 min — launch.sh waits it
out and tails the log if anything fails.) Then talk to it:
./chat.sh "explain quicksort in 3 bullet points"
# or hit the OpenAI-compatible API directly:
curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"model":"GLM-5.2-NVFP4-REAP-469B","messages":[{"role":"user","content":"hello"}]}'
docker compose up -d works too (same config; thinking off).
Reasoning (thinking) — off by default, so it just works
GLM-5.2 is a reasoning model. Thinking is OFF by default so a normal request —
even with a small max_tokens — returns a direct answer in message.content.
Why this matters: a reasoning model with thinking on spends its token budget “thinking” first. A short
max_tokensthen gets consumed before the answer, socontentcomes back empty (finish_reason: "length"). Defaulting thinking off avoids that entirely — the endpoint behaves like any normal chat model.
Want the chain-of-thought (better on hard math/code/logic)? Turn it on:
ENABLE_THINKING=1 ./launch.sh # restart with reasoning enabled
./chat.sh --think "prove sqrt(2) is irrational"
With ENABLE_THINKING=1 the server adds --reasoning-parser glm45, so the
chain-of-thought streams into message.reasoning and the answer into
message.content. Give thinking requests a generous max_tokens (≥ 2000).
| Mode | Default request behaviour | Use when |
|---|---|---|
| thinking off (default) | answer directly in content, any max_tokens | general use, agents, tools |
thinking on (ENABLE_THINKING=1) | reasoning + content split; use max_tokens ≥ 2000 | hard reasoning, benchmarks |
Tool calling (--tool-call-parser glm47) works in both modes — parse
message.tool_calls as usual.
Validated configuration
| Setting | Value | Why |
|---|---|---|
--tensor-parallel-size | 4 | one shard per GPU |
--decode-context-parallel-size | 4 | shards MLA KV across the 4 GPUs → 250k fits |
--max-model-len | 250000 | 710,593-token pool → 2.84× concurrency at 250k |
--max-num-seqs | 2 | target concurrency |
--kv-cache-dtype | fp8 | fp8_ds_mla; required on SM120 (bf16 = garbage) |
--quantization | modelopt_fp4 | NVFP4 weights |
--attention-backend | B12X_MLA_SPARSE | SM120-native sparse MLA decode |
--moe-backend | b12x | NVFP4 MoE |
--speculative-config | mtp, 3 tokens | MTP speculative decode |
--hf-overrides | index_topk_pattern | coherence-critical (see below) |
--tool-call-parser | glm47 | tool calls (always on) |
--reasoning-parser | glm45 | only with ENABLE_THINKING=1 (thinking off by default) |
Why index_topk_pattern (coherence-critical)
GLM-5.2 uses DeepSeek Sparse Attention. vLLM reads index_topk_pattern, not the
checkpoint’s indexer_types array. Without the pattern, all 78 layers build full
indexers and the 57 “share/skip” (S) layers corrupt long-context attention →
garbage output. The 78-char F/S string (21 F, 57 S) is derived from the
model’s indexer_types and injected via --hf-overrides. On boot you should see
57 log lines: Using index_topk_pattern/index_topk_freq to skip sparse MLA indexer ….
Why DCP_SIZE=4
With DCP=1 the MLA KV cache is replicated per TP rank, so a single 250k request needs
~14.5 GB but only ~10.3 GB/GPU is free → OOM (max ~177k). decode-context-parallel-size=4
shards the KV across the 4 GPUs along the sequence dim, yielding a 710,593-token pool.
DCP=4 is the recipe — 250k context at ~3× concurrency
DCP=4 is the default and the validated flagship: 250k context · ~3× concurrency
(710,593-token KV pool) · ~60 tok/s decode (warm, with MTP). Sharding the MLA KV
across the 4 GPUs along the sequence dim is what makes 250k fit at all.
DCP_SIZE | Max context | KV pool @ max | Concurrency | Decode (warm) | |
|---|---|---|---|---|---|
| 4 (default) | 250000 | 710,593 tok | ~3× (2.84×) | ~60 tok/s | the recipe |
| 2 | 250000 | ~355k tok | 1.42× | ~49 tok/s | middle ground |
| 1 | ~131k (OOM > ~177k) | ~178k tok | 1.36× | ~81 tok/s | optional: smaller ctx, faster |
Measured at temp 0, b12x, MTP=3, fp8 KV. Decode is steady-state and MTP-acceptance-dependent (~60 tok/s warm short ctx). Cold TTFT is compile-dominated — the first request at a new prompt length JIT-compiles that size bucket (tens of seconds), then warm / prefix-cache hits are fast. A slow first request is the kernel cache warming up, not a hang.
Only if you specifically want a smaller, faster ≤128k box instead, set DCP_SIZE=1
MAX_MODEL_LEN=131072in.env(~81 tok/s, but caps at ~131k and loses the 250k / ~3× headroom).
Why NCCL_P2P_DISABLE=1
These RTX PRO 6000 are PCIe (no NVLink); the b12x PCIe allreduce path hangs at NCCL init without P2P disabled.
Performance (measured, warm)
| Metric | Value |
|---|---|
| Decode | ~60 tok/s (short ctx, warm, MTP), ~45 tok/s @ 64k–100k |
| Prefill | ~5,100 tok/s @ 64k (warm); ~45k–65k tok/s on prefix-cache hits |
| TTFT | sub-second (short ctx); ~12 s for a fresh uncached 64k prefill |
| Concurrency | ~3× (2.84×) at 250k (710,593-token KV pool) |
First touch of a brand-new long prefix incurs a one-time compile of that size bucket (e.g. ~195 s for a fresh 99.5k prompt). Subsequent same-size prefills and prefix-cache hits are fast.
Testing
python3 test/coherence_test.py core # logic, math, code, philosophy, ascii, multi-turn
python3 test/coherence_test.py long # 64k needle-in-haystack recall
python3 test/longctx_multiturn_test.py # ~100k-token, 6-turn reasoning battery
All harnesses stream with no max_tokens and report TTFT, prefill tok/s, and
decode tok/s. The reasoning model emits its chain-of-thought in the reasoning field
and the final answer in content.
fp8 / fp4 KV cache on SM120
- fp8 KV (
fp8_ds_mla) works and is the practical floor. The checkpoint ships nok/v_scale, so fp8 runs at scale 1.0 (a one-line startup warning). - fp4 KV is hardware-blocked on SM120: the DSA fp4 indexer cache asserts SM100 (datacenter Blackwell, B200/GB200). Not available on the RTX PRO 6000.
Troubleshooting
| Symptom | Fix |
|---|---|
| OOM at 250k, “estimated maximum model length ~177728” | set DCP_SIZE=4 |
| Garbage / incoherent long-context output | ensure INDEX_TOPK_PATTERN is set (57 skip lines on boot) |
| Hang at NCCL init | keep NCCL_P2P_DISABLE=1 |
| Garbage at all lengths | --kv-cache-dtype fp8 is mandatory on SM120 |
Empty / null content | running with ENABLE_THINKING=1 + small max_tokens (thinking ate the budget) |
Similar Articles
@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…
A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.
@totheagi: We're the first to make the full GLM-5.2 (FP8) run on RTX 4090s. GLM-5.2 is the new 753B SOTA open-weights model, and i…
We're the first to run the full GLM-5.2 (753B FP8) on RTX 4090s by porting sparse-attention kernels to Ada GPUs, enabling frontier open-weights model on commodity hardware.
@leopardracer: https://x.com/leopardracer/status/2055341758523883631
A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.
My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config
A user shares their Docker deployment configuration for running the GLM-5.2-FP8 model on HGX-H200 hardware using SGLang, achieving 262k context and 70 tokens/s.
@ollama: GLM 5.2 on Ollama's cloud just doubled GPU capacity to handle the volume of usage! This is all US based, and running on…
Ollama doubled GPU capacity for GLM 5.2 on its US cloud, using NVIDIA B300 Blackwell GPUs, emphasizing privacy and open models.