@0xSero: Rejoice fellow 6000 enjoyers. We have GLM at home

X AI KOLs Following 06/18/26, 08:20 AM Tools

docker vllm model-serving glm-5-2 blackwell-gpus nvfp4 deepseek-sparse-attention

Summary

A turnkey Docker setup to serve the GLM-5.2-NVFP4-REAP-469B model on 4× RTX PRO 6000 Blackwell GPUs using vLLM, with detailed instructions and configuration options.

Rejoice fellow 6000 enjoyers. We have GLM at home https://t.co/pAUQf4WPpe

Original Article

View Cached Full Text

Cached at: 06/19/26, 12:13 AM

Rejoice fellow 6000 enjoyers.

We have GLM at home https://t.co/pAUQf4WPpe

0xSero/glm-5.2-sm120

Source: https://github.com/0xSero/glm-5.2-sm120

GLM-5.2-NVFP4-REAP-469B — vLLM serving (4× RTX PRO 6000 Blackwell)

A turnkey Docker setup to serve 0xSero/GLM-5.2-NVFP4-REAP-469B (REAP-pruned, NVFP4, DeepSeek-Sparse-Attention) on 4× NVIDIA RTX PRO 6000 Blackwell (SM120, 96 GB each) with the voipmonitor b12x vLLM image.

Model: huggingface.co/0xSero/GLM-5.2-NVFP4-REAP-469B · ~313 GB on disk (NVFP4) · REAP-pruned 469B MoE · DeepSeek Sparse Attention + MTP.

Validated config: DCP=4 · 250k context · ~3× concurrency · ~60 tok/s decode · fp8 KV cache · MTP speculative decode · tool-calling · thinking off by default (toggle on).

Hardware target


GPUs	4× RTX PRO 6000 Blackwell (SM120), 96 GB each, no NVLink (PCIe)
Model on disk	~313 GB (NVFP4), ~78.6 GB/GPU resident
Interconnect	PCIe — requires `NCCL_P2P_DISABLE=1` (see below)

Prerequisites

Docker + the NVIDIA Container Toolkit.
Access to the b12x image (voipmonitor/vllm:black-benediction-…). It is the only image that bundles GlmMoeDsaForCausalLM + Glm4MoeMTPModel + the SM120 sparse-MLA kernel (B12X_MLA_SPARSE) + the ModelOpt NVFP4 MoE loader.
The model weights on a local path (default /mnt/llm_models/GLM-5.2-NVFP4-REAP-469B).

Quick start — one command

# 1. Download the weights (~313 GB NVFP4) — needs the hf CLI: pip install -U huggingface_hub
hf download 0xSero/GLM-5.2-NVFP4-REAP-469B --local-dir /mnt/llm_models/GLM-5.2-NVFP4-REAP-469B

# 2. (optional) point at a different weights path / port
cp .env.example .env        # defaults already target /mnt/llm_models/... on port 8000

# 3. Launch. This starts the server, waits for it, and smoke-tests it.
./launch.sh

launch.sh blocks until the server is actually answering, then prints:

  ✅ READY — GLM-5.2 is serving and answered: READY
  Endpoint : http://localhost:8000/v1   (model: GLM-5.2-NVFP4-REAP-469B)
  Try it   : ./chat.sh "write a haiku about GPUs"

(First boot compiles kernels + captures CUDA graphs, ~6 min — launch.sh waits it out and tails the log if anything fails.) Then talk to it:

./chat.sh "explain quicksort in 3 bullet points"
# or hit the OpenAI-compatible API directly:
curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model":"GLM-5.2-NVFP4-REAP-469B","messages":[{"role":"user","content":"hello"}]}'

docker compose up -d works too (same config; thinking off).

Reasoning (thinking) — off by default, so it just works

GLM-5.2 is a reasoning model. Thinking is OFF by default so a normal request — even with a small max_tokens — returns a direct answer in message.content.

Why this matters: a reasoning model with thinking on spends its token budget “thinking” first. A short max_tokens then gets consumed before the answer, so content comes back empty (finish_reason: "length"). Defaulting thinking off avoids that entirely — the endpoint behaves like any normal chat model.

Want the chain-of-thought (better on hard math/code/logic)? Turn it on:

ENABLE_THINKING=1 ./launch.sh          # restart with reasoning enabled
./chat.sh --think "prove sqrt(2) is irrational"

With ENABLE_THINKING=1 the server adds --reasoning-parser glm45, so the chain-of-thought streams into message.reasoning and the answer into message.content. Give thinking requests a generous max_tokens (≥ 2000).

Mode	Default request behaviour	Use when
thinking off (default)	answer directly in `content`, any `max_tokens`	general use, agents, tools
thinking on (`ENABLE_THINKING=1`)	`reasoning` + `content` split; use `max_tokens` ≥ 2000	hard reasoning, benchmarks

Tool calling (--tool-call-parser glm47) works in both modes — parse message.tool_calls as usual.

Validated configuration

Setting	Value	Why
`--tensor-parallel-size`	4	one shard per GPU
`--decode-context-parallel-size`	4	shards MLA KV across the 4 GPUs → 250k fits
`--max-model-len`	250000	710,593-token pool → 2.84× concurrency at 250k
`--max-num-seqs`	2	target concurrency
`--kv-cache-dtype`	fp8	`fp8_ds_mla`; required on SM120 (bf16 = garbage)
`--quantization`	modelopt_fp4	NVFP4 weights
`--attention-backend`	B12X_MLA_SPARSE	SM120-native sparse MLA decode
`--moe-backend`	b12x	NVFP4 MoE
`--speculative-config`	mtp, 3 tokens	MTP speculative decode
`--hf-overrides`	`index_topk_pattern`	coherence-critical (see below)
`--tool-call-parser`	glm47	tool calls (always on)
`--reasoning-parser`	glm45	only with `ENABLE_THINKING=1` (thinking off by default)

Why `index_topk_pattern` (coherence-critical)

GLM-5.2 uses DeepSeek Sparse Attention. vLLM reads index_topk_pattern, not the checkpoint’s indexer_types array. Without the pattern, all 78 layers build full indexers and the 57 “share/skip” (S) layers corrupt long-context attention → garbage output. The 78-char F/S string (21 F, 57 S) is derived from the model’s indexer_types and injected via --hf-overrides. On boot you should see 57 log lines: Using index_topk_pattern/index_topk_freq to skip sparse MLA indexer ….

Why `DCP_SIZE=4`

With DCP=1 the MLA KV cache is replicated per TP rank, so a single 250k request needs ~14.5 GB but only ~10.3 GB/GPU is free → OOM (max ~177k). decode-context-parallel-size=4 shards the KV across the 4 GPUs along the sequence dim, yielding a 710,593-token pool.

DCP=4 is the recipe — 250k context at ~3× concurrency

DCP=4 is the default and the validated flagship: 250k context · ~3× concurrency (710,593-token KV pool) · ~60 tok/s decode (warm, with MTP). Sharding the MLA KV across the 4 GPUs along the sequence dim is what makes 250k fit at all.

`DCP_SIZE`	Max context	KV pool @ max	Concurrency	Decode (warm)
4 (default)	250000	710,593 tok	~3× (2.84×)	~60 tok/s	the recipe
2	250000	~355k tok	1.42×	~49 tok/s	middle ground
1	~131k (OOM > ~177k)	~178k tok	1.36×	~81 tok/s	optional: smaller ctx, faster

Measured at temp 0, b12x, MTP=3, fp8 KV. Decode is steady-state and MTP-acceptance-dependent (~60 tok/s warm short ctx). Cold TTFT is compile-dominated — the first request at a new prompt length JIT-compiles that size bucket (tens of seconds), then warm / prefix-cache hits are fast. A slow first request is the kernel cache warming up, not a hang.

Only if you specifically want a smaller, faster ≤128k box instead, set DCP_SIZE=1

MAX_MODEL_LEN=131072 in .env (~81 tok/s, but caps at ~131k and loses the 250k / ~3× headroom).

Why `NCCL_P2P_DISABLE=1`

These RTX PRO 6000 are PCIe (no NVLink); the b12x PCIe allreduce path hangs at NCCL init without P2P disabled.

Performance (measured, warm)

Metric	Value
Decode	~60 tok/s (short ctx, warm, MTP), ~45 tok/s @ 64k–100k
Prefill	~5,100 tok/s @ 64k (warm); ~45k–65k tok/s on prefix-cache hits
TTFT	sub-second (short ctx); ~12 s for a fresh uncached 64k prefill
Concurrency	~3× (2.84×) at 250k (710,593-token KV pool)

First touch of a brand-new long prefix incurs a one-time compile of that size bucket (e.g. ~195 s for a fresh 99.5k prompt). Subsequent same-size prefills and prefix-cache hits are fast.

Testing

python3 test/coherence_test.py core         # logic, math, code, philosophy, ascii, multi-turn
python3 test/coherence_test.py long         # 64k needle-in-haystack recall
python3 test/longctx_multiturn_test.py      # ~100k-token, 6-turn reasoning battery

All harnesses stream with no max_tokens and report TTFT, prefill tok/s, and decode tok/s. The reasoning model emits its chain-of-thought in the reasoning field and the final answer in content.

fp8 / fp4 KV cache on SM120

fp8 KV (fp8_ds_mla) works and is the practical floor. The checkpoint ships no k/v_scale, so fp8 runs at scale 1.0 (a one-line startup warning).
fp4 KV is hardware-blocked on SM120: the DSA fp4 indexer cache asserts SM100 (datacenter Blackwell, B200/GB200). Not available on the RTX PRO 6000.

Troubleshooting

Symptom	Fix
OOM at 250k, “estimated maximum model length ~177728”	set `DCP_SIZE=4`
Garbage / incoherent long-context output	ensure `INDEX_TOPK_PATTERN` is set (57 skip lines on boot)
Hang at NCCL init	keep `NCCL_P2P_DISABLE=1`
Garbage at all lengths	`--kv-cache-dtype fp8` is mandatory on SM120
Empty / `null` `content`	running with `ENABLE_THINKING=1` + small `max_tokens` (thinking ate the budget)

@0xSero: Rejoice fellow 6000 enjoyers. We have GLM at home

0xSero/glm-5.2-sm120

GLM-5.2-NVFP4-REAP-469B — vLLM serving (4× RTX PRO 6000 Blackwell)

Hardware target

Prerequisites

Quick start — one command

Reasoning (thinking) — off by default, so it just works

Validated configuration

Why `index_topk_pattern` (coherence-critical)

Why `DCP_SIZE=4`

DCP=4 is the recipe — 250k context at ~3× concurrency

Why `NCCL_P2P_DISABLE=1`

Performance (measured, warm)

Testing

fp8 / fp4 KV cache on SM120

Troubleshooting

Similar Articles

@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…

@totheagi: We're the first to make the full GLM-5.2 (FP8) run on RTX 4090s. GLM-5.2 is the new 753B SOTA open-weights model, and i…

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config

@ollama: GLM 5.2 on Ollama's cloud just doubled GPU capacity to handle the volume of usage! This is all US based, and running on…

Submit Feedback

Similar Articles

@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…

@totheagi: We're the first to make the full GLM-5.2 (FP8) run on RTX 4090s. GLM-5.2 is the new 753B SOTA open-weights model, and i…

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config

@ollama: GLM 5.2 on Ollama's cloud just doubled GPU capacity to handle the volume of usage! This is all US based, and running on…
Ollama doubled GPU capacity for GLM 5.2 on its US cloud, using NVIDIA B300 Blackwell GPUs, emphasizing privacy and open models.

0xSero/glm-5.2-sm120

GLM-5.2-NVFP4-REAP-469B — vLLM serving (4× RTX PRO 6000 Blackwell)

Hardware target

Prerequisites

Quick start — one command

Reasoning (thinking) — off by default, so it just works

Validated configuration

Why index_topk_pattern (coherence-critical)

Why DCP_SIZE=4

DCP=4 is the recipe — 250k context at ~3× concurrency

Why NCCL_P2P_DISABLE=1

Performance (measured, warm)

Testing

fp8 / fp4 KV cache on SM120

Troubleshooting

Similar Articles

@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…

@totheagi: We're the first to make the full GLM-5.2 (FP8) run on RTX 4090s. GLM-5.2 is the new 753B SOTA open-weights model, and i…

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config

@ollama: GLM 5.2 on Ollama's cloud just doubled GPU capacity to handle the volume of usage! This is all US based, and running on…

Submit Feedback

Why `index_topk_pattern` (coherence-critical)

Why `DCP_SIZE=4`

Why `NCCL_P2P_DISABLE=1`