@0xSero: Rejoice fellow 6000 enjoyers. We have GLM at home
Summary
A turnkey Docker setup to serve the GLM-5.2-NVFP4-REAP-469B model on 4× RTX PRO 6000 Blackwell GPUs using vLLM, with detailed instructions and configuration options.
View Cached Full Text
Cached at: 06/19/26, 12:13 AM
Rejoice fellow 6000 enjoyers.
We have GLM at home https://t.co/pAUQf4WPpe
0xSero/glm-5.2-sm120
Source: https://github.com/0xSero/glm-5.2-sm120
GLM-5.2-NVFP4-REAP-469B — vLLM serving (4× RTX PRO 6000 Blackwell)
A turnkey Docker setup to serve 0xSero/GLM-5.2-NVFP4-REAP-469B
(REAP-pruned, NVFP4, DeepSeek-Sparse-Attention) on 4× NVIDIA RTX PRO 6000 Blackwell
(SM120, 96 GB each) with the voipmonitor b12x vLLM image.
Model: huggingface.co/0xSero/GLM-5.2-NVFP4-REAP-469B · ~313 GB on disk (NVFP4) · REAP-pruned 469B MoE · DeepSeek Sparse Attention + MTP.
Validated config: DCP=4 · 250k context · ~3× concurrency · ~60 tok/s decode · fp8 KV cache · MTP speculative decode · tool-calling · thinking off by default (toggle on).
Hardware target
| GPUs | 4× RTX PRO 6000 Blackwell (SM120), 96 GB each, no NVLink (PCIe) |
| Model on disk | ~313 GB (NVFP4), ~78.6 GB/GPU resident |
| Interconnect | PCIe — requires NCCL_P2P_DISABLE=1 (see below) |
Prerequisites
- Docker + the NVIDIA Container Toolkit.
- Access to the b12x image (
voipmonitor/vllm:black-benediction-…). It is the only image that bundlesGlmMoeDsaForCausalLM+Glm4MoeMTPModel+ the SM120 sparse-MLA kernel (B12X_MLA_SPARSE) + the ModelOpt NVFP4 MoE loader. - The model weights on a local path (default
/mnt/llm_models/GLM-5.2-NVFP4-REAP-469B).
Quick start — one command
# 1. Download the weights (~313 GB NVFP4) — needs the hf CLI: pip install -U huggingface_hub
hf download 0xSero/GLM-5.2-NVFP4-REAP-469B --local-dir /mnt/llm_models/GLM-5.2-NVFP4-REAP-469B
# 2. (optional) point at a different weights path / port
cp .env.example .env # defaults already target /mnt/llm_models/... on port 8000
# 3. Launch. This starts the server, waits for it, and smoke-tests it.
./launch.sh
launch.sh blocks until the server is actually answering, then prints:
✅ READY — GLM-5.2 is serving and answered: READY
Endpoint : http://localhost:8000/v1 (model: GLM-5.2-NVFP4-REAP-469B)
Try it : ./chat.sh "write a haiku about GPUs"
(First boot compiles kernels + captures CUDA graphs, ~6 min — launch.sh waits it
out and tails the log if anything fails.) Then talk to it:
./chat.sh "explain quicksort in 3 bullet points"
# or hit the OpenAI-compatible API directly:
curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"model":"GLM-5.2-NVFP4-REAP-469B","messages":[{"role":"user","content":"hello"}]}'
docker compose up -d works too (same config; thinking off).
Reasoning (thinking) — off by default, so it just works
GLM-5.2 is a reasoning model. Thinking is OFF by default so a normal request —
even with a small max_tokens — returns a direct answer in message.content.
Why this matters: a reasoning model with thinking on spends its token budget “thinking” first. A short
max_tokensthen gets consumed before the answer, socontentcomes back empty (finish_reason: "length"). Defaulting thinking off avoids that entirely — the endpoint behaves like any normal chat model.
Want the chain-of-thought (better on hard math/code/logic)? Turn it on:
ENABLE_THINKING=1 ./launch.sh # restart with reasoning enabled
./chat.sh --think "prove sqrt(2) is irrational"
With ENABLE_THINKING=1 the server adds --reasoning-parser glm45, so the
chain-of-thought streams into message.reasoning and the answer into
message.content. Give thinking requests a generous max_tokens (≥ 2000).
| Mode | Default request behaviour | Use when |
|---|---|---|
| thinking off (default) | answer directly in content, any max_tokens | general use, agents, tools |
thinking on (ENABLE_THINKING=1) | reasoning + content split; use max_tokens ≥ 2000 | hard reasoning, benchmarks |
Tool calling (--tool-call-parser glm47) works in both modes — parse
message.tool_calls as usual.
Validated configuration
| Setting | Value | Why |
|---|---|---|
--tensor-parallel-size | 4 | one shard per GPU |
--decode-context-parallel-size | 4 | shards MLA KV across the 4 GPUs → 250k fits |
--max-model-len | 250000 | 710,593-token pool → 2.84× concurrency at 250k |
--max-num-seqs | 2 | target concurrency |
--kv-cache-dtype | fp8 | fp8_ds_mla; required on SM120 (bf16 = garbage) |
--quantization | modelopt_fp4 | NVFP4 weights |
--attention-backend | B12X_MLA_SPARSE | SM120-native sparse MLA decode |
--moe-backend | b12x | NVFP4 MoE |
--speculative-config | mtp, 3 tokens | MTP speculative decode |
--hf-overrides | index_topk_pattern | coherence-critical (see below) |
--tool-call-parser | glm47 | tool calls (always on) |
--reasoning-parser | glm45 | only with ENABLE_THINKING=1 (thinking off by default) |
Why index_topk_pattern (coherence-critical)
GLM-5.2 uses DeepSeek Sparse Attention. vLLM reads index_topk_pattern, not the
checkpoint’s indexer_types array. Without the pattern, all 78 layers build full
indexers and the 57 “share/skip” (S) layers corrupt long-context attention →
garbage output. The 78-char F/S string (21 F, 57 S) is derived from the
model’s indexer_types and injected via --hf-overrides. On boot you should see
57 log lines: Using index_topk_pattern/index_topk_freq to skip sparse MLA indexer ….
Why DCP_SIZE=4
With DCP=1 the MLA KV cache is replicated per TP rank, so a single 250k request needs
~14.5 GB but only ~10.3 GB/GPU is free → OOM (max ~177k). decode-context-parallel-size=4
shards the KV across the 4 GPUs along the sequence dim, yielding a 710,593-token pool.
DCP=4 is the recipe — 250k context at ~3× concurrency
DCP=4 is the default and the validated flagship: 250k context · ~3× concurrency
(710,593-token KV pool) · ~60 tok/s decode (warm, with MTP). Sharding the MLA KV
across the 4 GPUs along the sequence dim is what makes 250k fit at all.
DCP_SIZE | Max context | KV pool @ max | Concurrency | Decode (warm) | |
|---|---|---|---|---|---|
| 4 (default) | 250000 | 710,593 tok | ~3× (2.84×) | ~60 tok/s | the recipe |
| 2 | 250000 | ~355k tok | 1.42× | ~49 tok/s | middle ground |
| 1 | ~131k (OOM > ~177k) | ~178k tok | 1.36× | ~81 tok/s | optional: smaller ctx, faster |
Measured at temp 0, b12x, MTP=3, fp8 KV. Decode is steady-state and MTP-acceptance-dependent (~60 tok/s warm short ctx). Cold TTFT is compile-dominated — the first request at a new prompt length JIT-compiles that size bucket (tens of seconds), then warm / prefix-cache hits are fast. A slow first request is the kernel cache warming up, not a hang.
Only if you specifically want a smaller, faster ≤128k box instead, set DCP_SIZE=1
MAX_MODEL_LEN=131072in.env(~81 tok/s, but caps at ~131k and loses the 250k / ~3× headroom).
Why NCCL_P2P_DISABLE=1
These RTX PRO 6000 are PCIe (no NVLink); the b12x PCIe allreduce path hangs at NCCL init without P2P disabled.
Performance (measured, warm)
| Metric | Value |
|---|---|
| Decode | ~60 tok/s (short ctx, warm, MTP), ~45 tok/s @ 64k–100k |
| Prefill | ~5,100 tok/s @ 64k (warm); ~45k–65k tok/s on prefix-cache hits |
| TTFT | sub-second (short ctx); ~12 s for a fresh uncached 64k prefill |
| Concurrency | ~3× (2.84×) at 250k (710,593-token KV pool) |
First touch of a brand-new long prefix incurs a one-time compile of that size bucket (e.g. ~195 s for a fresh 99.5k prompt). Subsequent same-size prefills and prefix-cache hits are fast.
Testing
python3 test/coherence_test.py core # logic, math, code, philosophy, ascii, multi-turn
python3 test/coherence_test.py long # 64k needle-in-haystack recall
python3 test/longctx_multiturn_test.py # ~100k-token, 6-turn reasoning battery
All harnesses stream with no max_tokens and report TTFT, prefill tok/s, and
decode tok/s. The reasoning model emits its chain-of-thought in the reasoning field
and the final answer in content.
fp8 / fp4 KV cache on SM120
- fp8 KV (
fp8_ds_mla) works and is the practical floor. The checkpoint ships nok/v_scale, so fp8 runs at scale 1.0 (a one-line startup warning). - fp4 KV is hardware-blocked on SM120: the DSA fp4 indexer cache asserts SM100 (datacenter Blackwell, B200/GB200). Not available on the RTX PRO 6000.
Troubleshooting
| Symptom | Fix |
|---|---|
| OOM at 250k, “estimated maximum model length ~177728” | set DCP_SIZE=4 |
| Garbage / incoherent long-context output | ensure INDEX_TOPK_PATTERN is set (57 skip lines on boot) |
| Hang at NCCL init | keep NCCL_P2P_DISABLE=1 |
| Garbage at all lengths | --kv-cache-dtype fp8 is mandatory on SM120 |
Empty / null content | running with ENABLE_THINKING=1 + small max_tokens (thinking ate the budget) |
Similar Articles
@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…
A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.
@totheagi: We're the first to make the full GLM-5.2 (FP8) run on RTX 4090s. GLM-5.2 is the new 753B SOTA open-weights model, and i…
We're the first to run the full GLM-5.2 (753B FP8) on RTX 4090s by porting sparse-attention kernels to Ada GPUs, enabling frontier open-weights model on commodity hardware.
@leopardracer: https://x.com/leopardracer/status/2055341758523883631
A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.
My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config
A user shares their Docker deployment configuration for running the GLM-5.2-FP8 model on HGX-H200 hardware using SGLang, achieving 262k context and 70 tokens/s.
club-5060ti: practical RTX 5060 Ti local LLM notes and configs
A GitHub repository providing practical configurations and benchmarks for running local LLMs (like Qwen3.6 27B) on dual RTX 5060 Ti 16GB cards using vLLM and llama.cpp.