My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config

Reddit r/LocalLLaMA Tools

Summary

A user shares their Docker deployment configuration for running the GLM-5.2-FP8 model on HGX-H200 hardware using SGLang, achieving 262k context and 70 tokens/s.

Halo lads. Name says it all. Right now, after 1-2 hours of experimenting, this is maximum i could squeeze out current hardware No, im not rich. Its my companies GPUs, just sharing my experience docker run -d \ --name glm-5.2-sglang \ --restart unless-stopped \ --gpus all \ --shm-size 32g \ --ipc=host \ -v /data/models/glm-5.2:/model \ -p 30000:30000 \ lmsysorg/sglang:latest \ sglang serve \ --model-path /model \ --served-model-name glm-5.2 \ --host 0.0.0.0 \ --port 30000 \ --tp 8 \ --mem-fraction-static 0.83 \ --enable-metrics \ --reasoning-parser glm45 \ --tool-call-parser glm47 \ --cuda-graph-max-bs 256 Cookbook`s flags, i did not use: DP - limits context to 120k~ on each shard. I turned off everything related to it, just pure TP moe-a2a-backend deepep - idk how, but it actually slows down token/s. 50t/s~ on vs 70t/s~ off mem-fraction-static 0.83 - if you try to use more, OOM guaranteed result is 262k context and 70t/s So ye, that`s it. If you have any questions feel free to ask, i`ll try to answer btw vLLM official recipes wont work for H200. i guess, its because of kv cache fp8 quant on dsv3 architecture
Original Article

Similar Articles

zai-org/GLM-5.2-FP8

Hugging Face Models Trending

Z.AI releases GLM-5.2, a flagship open-source model with a solid 1M-token context, improved coding capabilities, and a new IndexShare sparse attention architecture that reduces FLOPs by 2.9x at 1M context.

GLM-5.2: Built for Long-Horizon Tasks

Hugging Face Blog

Z.AI introduces GLM-5.2, a flagship model designed for long-horizon tasks with a solid 1M-token context, improved coding capabilities, and an MIT open-source license, showing competitive performance against leading models like Opus 4.8 and GPT-5.5.