My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config

Reddit r/LocalLLaMA 06/17/26, 06:03 PM Tools

glm-5.2 sglang docker deployment configuration h200 fp8

Summary

A user shares their Docker deployment configuration for running the GLM-5.2-FP8 model on HGX-H200 hardware using SGLang, achieving 262k context and 70 tokens/s.

Halo lads. Name says it all. Right now, after 1-2 hours of experimenting, this is maximum i could squeeze out current hardware No, im not rich. Its my companies GPUs, just sharing my experience docker run -d \ --name glm-5.2-sglang \ --restart unless-stopped \ --gpus all \ --shm-size 32g \ --ipc=host \ -v /data/models/glm-5.2:/model \ -p 30000:30000 \ lmsysorg/sglang:latest \ sglang serve \ --model-path /model \ --served-model-name glm-5.2 \ --host 0.0.0.0 \ --port 30000 \ --tp 8 \ --mem-fraction-static 0.83 \ --enable-metrics \ --reasoning-parser glm45 \ --tool-call-parser glm47 \ --cuda-graph-max-bs 256 Cookbook`s flags, i did not use: DP - limits context to 120k~ on each shard. I turned off everything related to it, just pure TP moe-a2a-backend deepep - idk how, but it actually slows down token/s. 50t/s~ on vs 70t/s~ off mem-fraction-static 0.83 - if you try to use more, OOM guaranteed result is 262k context and 70t/s So ye, that`s it. If you have any questions feel free to ask, i`ll try to answer btw vLLM official recipes wont work for H200. i guess, its because of kv cache fp8 quant on dsv3 architecture

Original Article

My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config

Similar Articles

@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…

@0xSero: Rejoice fellow 6000 enjoyers. We have GLM at home

Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)

zai-org/GLM-5.2-FP8

GLM-5.2: Built for Long-Horizon Tasks

Submit Feedback

Similar Articles

@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…

@0xSero: Rejoice fellow 6000 enjoyers. We have GLM at home

Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)

GLM-5.2: Built for Long-Horizon Tasks