My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config
Summary
A user shares their Docker deployment configuration for running the GLM-5.2-FP8 model on HGX-H200 hardware using SGLang, achieving 262k context and 70 tokens/s.
Similar Articles
@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…
A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.
@0xSero: Rejoice fellow 6000 enjoyers. We have GLM at home
A turnkey Docker setup to serve the GLM-5.2-NVFP4-REAP-469B model on 4× RTX PRO 6000 Blackwell GPUs using vLLM, with detailed instructions and configuration options.
Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)
A user runs GLM-5.2 locally on CPU only, demonstrating how to run a large model on a modest setup.
zai-org/GLM-5.2-FP8
Z.AI releases GLM-5.2, a flagship open-source model with a solid 1M-token context, improved coding capabilities, and a new IndexShare sparse attention architecture that reduces FLOPs by 2.9x at 1M context.
GLM-5.2: Built for Long-Horizon Tasks
Z.AI introduces GLM-5.2, a flagship model designed for long-horizon tasks with a solid 1M-token context, improved coding capabilities, and an MIT open-source license, showing competitive performance against leading models like Opus 4.8 and GPT-5.5.