@0xSero: Rejoice fellow 6000 enjoyers. We have GLM at home

X AI KOLs Following Tools

Summary

A turnkey Docker setup to serve the GLM-5.2-NVFP4-REAP-469B model on 4× RTX PRO 6000 Blackwell GPUs using vLLM, with detailed instructions and configuration options.

Rejoice fellow 6000 enjoyers. We have GLM at home https://t.co/pAUQf4WPpe
Original Article
View Cached Full Text

Cached at: 06/19/26, 12:13 AM

Rejoice fellow 6000 enjoyers.

We have GLM at home https://t.co/pAUQf4WPpe


0xSero/glm-5.2-sm120

Source: https://github.com/0xSero/glm-5.2-sm120

GLM-5.2-NVFP4-REAP-469B — vLLM serving (4× RTX PRO 6000 Blackwell)

A turnkey Docker setup to serve 0xSero/GLM-5.2-NVFP4-REAP-469B (REAP-pruned, NVFP4, DeepSeek-Sparse-Attention) on 4× NVIDIA RTX PRO 6000 Blackwell (SM120, 96 GB each) with the voipmonitor b12x vLLM image.

Model: huggingface.co/0xSero/GLM-5.2-NVFP4-REAP-469B · ~313 GB on disk (NVFP4) · REAP-pruned 469B MoE · DeepSeek Sparse Attention + MTP.

Validated config: DCP=4 · 250k context · ~3× concurrency · ~60 tok/s decode · fp8 KV cache · MTP speculative decode · tool-calling · thinking off by default (toggle on).

Hardware target

GPUs4× RTX PRO 6000 Blackwell (SM120), 96 GB each, no NVLink (PCIe)
Model on disk~313 GB (NVFP4), ~78.6 GB/GPU resident
InterconnectPCIe — requires NCCL_P2P_DISABLE=1 (see below)

Prerequisites

  • Docker + the NVIDIA Container Toolkit.
  • Access to the b12x image (voipmonitor/vllm:black-benediction-…). It is the only image that bundles GlmMoeDsaForCausalLM + Glm4MoeMTPModel + the SM120 sparse-MLA kernel (B12X_MLA_SPARSE) + the ModelOpt NVFP4 MoE loader.
  • The model weights on a local path (default /mnt/llm_models/GLM-5.2-NVFP4-REAP-469B).

Quick start — one command

# 1. Download the weights (~313 GB NVFP4) — needs the hf CLI: pip install -U huggingface_hub
hf download 0xSero/GLM-5.2-NVFP4-REAP-469B --local-dir /mnt/llm_models/GLM-5.2-NVFP4-REAP-469B

# 2. (optional) point at a different weights path / port
cp .env.example .env        # defaults already target /mnt/llm_models/... on port 8000

# 3. Launch. This starts the server, waits for it, and smoke-tests it.
./launch.sh

launch.sh blocks until the server is actually answering, then prints:

  ✅ READY — GLM-5.2 is serving and answered: READY
  Endpoint : http://localhost:8000/v1   (model: GLM-5.2-NVFP4-REAP-469B)
  Try it   : ./chat.sh "write a haiku about GPUs"

(First boot compiles kernels + captures CUDA graphs, ~6 min — launch.sh waits it out and tails the log if anything fails.) Then talk to it:

./chat.sh "explain quicksort in 3 bullet points"
# or hit the OpenAI-compatible API directly:
curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model":"GLM-5.2-NVFP4-REAP-469B","messages":[{"role":"user","content":"hello"}]}'

docker compose up -d works too (same config; thinking off).

Reasoning (thinking) — off by default, so it just works

GLM-5.2 is a reasoning model. Thinking is OFF by default so a normal request — even with a small max_tokens — returns a direct answer in message.content.

Why this matters: a reasoning model with thinking on spends its token budget “thinking” first. A short max_tokens then gets consumed before the answer, so content comes back empty (finish_reason: "length"). Defaulting thinking off avoids that entirely — the endpoint behaves like any normal chat model.

Want the chain-of-thought (better on hard math/code/logic)? Turn it on:

ENABLE_THINKING=1 ./launch.sh          # restart with reasoning enabled
./chat.sh --think "prove sqrt(2) is irrational"

With ENABLE_THINKING=1 the server adds --reasoning-parser glm45, so the chain-of-thought streams into message.reasoning and the answer into message.content. Give thinking requests a generous max_tokens (≥ 2000).

ModeDefault request behaviourUse when
thinking off (default)answer directly in content, any max_tokensgeneral use, agents, tools
thinking on (ENABLE_THINKING=1)reasoning + content split; use max_tokens ≥ 2000hard reasoning, benchmarks

Tool calling (--tool-call-parser glm47) works in both modes — parse message.tool_calls as usual.

Validated configuration

SettingValueWhy
--tensor-parallel-size4one shard per GPU
--decode-context-parallel-size4shards MLA KV across the 4 GPUs → 250k fits
--max-model-len250000710,593-token pool → 2.84× concurrency at 250k
--max-num-seqs2target concurrency
--kv-cache-dtypefp8fp8_ds_mla; required on SM120 (bf16 = garbage)
--quantizationmodelopt_fp4NVFP4 weights
--attention-backendB12X_MLA_SPARSESM120-native sparse MLA decode
--moe-backendb12xNVFP4 MoE
--speculative-configmtp, 3 tokensMTP speculative decode
--hf-overridesindex_topk_patterncoherence-critical (see below)
--tool-call-parserglm47tool calls (always on)
--reasoning-parserglm45only with ENABLE_THINKING=1 (thinking off by default)

Why index_topk_pattern (coherence-critical)

GLM-5.2 uses DeepSeek Sparse Attention. vLLM reads index_topk_pattern, not the checkpoint’s indexer_types array. Without the pattern, all 78 layers build full indexers and the 57 “share/skip” (S) layers corrupt long-context attention → garbage output. The 78-char F/S string (21 F, 57 S) is derived from the model’s indexer_types and injected via --hf-overrides. On boot you should see 57 log lines: Using index_topk_pattern/index_topk_freq to skip sparse MLA indexer ….

Why DCP_SIZE=4

With DCP=1 the MLA KV cache is replicated per TP rank, so a single 250k request needs ~14.5 GB but only ~10.3 GB/GPU is free → OOM (max ~177k). decode-context-parallel-size=4 shards the KV across the 4 GPUs along the sequence dim, yielding a 710,593-token pool.

DCP=4 is the recipe — 250k context at ~3× concurrency

DCP=4 is the default and the validated flagship: 250k context · ~3× concurrency (710,593-token KV pool) · ~60 tok/s decode (warm, with MTP). Sharding the MLA KV across the 4 GPUs along the sequence dim is what makes 250k fit at all.

DCP_SIZEMax contextKV pool @ maxConcurrencyDecode (warm)
4 (default)250000710,593 tok~3× (2.84×)~60 tok/sthe recipe
2250000~355k tok1.42×~49 tok/smiddle ground
1~131k (OOM > ~177k)~178k tok1.36×~81 tok/soptional: smaller ctx, faster

Measured at temp 0, b12x, MTP=3, fp8 KV. Decode is steady-state and MTP-acceptance-dependent (~60 tok/s warm short ctx). Cold TTFT is compile-dominated — the first request at a new prompt length JIT-compiles that size bucket (tens of seconds), then warm / prefix-cache hits are fast. A slow first request is the kernel cache warming up, not a hang.

Only if you specifically want a smaller, faster ≤128k box instead, set DCP_SIZE=1

  • MAX_MODEL_LEN=131072 in .env (~81 tok/s, but caps at ~131k and loses the 250k / ~3× headroom).

Why NCCL_P2P_DISABLE=1

These RTX PRO 6000 are PCIe (no NVLink); the b12x PCIe allreduce path hangs at NCCL init without P2P disabled.

Performance (measured, warm)

MetricValue
Decode~60 tok/s (short ctx, warm, MTP), ~45 tok/s @ 64k–100k
Prefill~5,100 tok/s @ 64k (warm); ~45k–65k tok/s on prefix-cache hits
TTFTsub-second (short ctx); ~12 s for a fresh uncached 64k prefill
Concurrency~3× (2.84×) at 250k (710,593-token KV pool)

First touch of a brand-new long prefix incurs a one-time compile of that size bucket (e.g. ~195 s for a fresh 99.5k prompt). Subsequent same-size prefills and prefix-cache hits are fast.

Testing

python3 test/coherence_test.py core         # logic, math, code, philosophy, ascii, multi-turn
python3 test/coherence_test.py long         # 64k needle-in-haystack recall
python3 test/longctx_multiturn_test.py      # ~100k-token, 6-turn reasoning battery

All harnesses stream with no max_tokens and report TTFT, prefill tok/s, and decode tok/s. The reasoning model emits its chain-of-thought in the reasoning field and the final answer in content.

fp8 / fp4 KV cache on SM120

  • fp8 KV (fp8_ds_mla) works and is the practical floor. The checkpoint ships no k/v_scale, so fp8 runs at scale 1.0 (a one-line startup warning).
  • fp4 KV is hardware-blocked on SM120: the DSA fp4 indexer cache asserts SM100 (datacenter Blackwell, B200/GB200). Not available on the RTX PRO 6000.

Troubleshooting

SymptomFix
OOM at 250k, “estimated maximum model length ~177728”set DCP_SIZE=4
Garbage / incoherent long-context outputensure INDEX_TOPK_PATTERN is set (57 skip lines on boot)
Hang at NCCL initkeep NCCL_P2P_DISABLE=1
Garbage at all lengths--kv-cache-dtype fp8 is mandatory on SM120
Empty / null contentrunning with ENABLE_THINKING=1 + small max_tokens (thinking ate the budget)

Similar Articles