@lmsysorg: NVIDIA just released an NVFP4 checkpoint of GLM-5.2 from @Zai_org, a 744B MoE (40B active) for reasoning & coding. Day-…
Summary
NVIDIA released an NVFP4 quantized checkpoint of GLM-5.2, a 744B MoE model (40B active) optimized for reasoning and coding, with day-0 support in SGLang.
View Cached Full Text
Cached at: 06/28/26, 12:01 PM
NVIDIA just released an NVFP4 checkpoint of GLM-5.2 from @Zai_org, a 744B MoE (40B active) for reasoning & coding. Day-0 support is live in SGLang! @nvidia
NVFP4 quantization via NVIDIA Model Optimizer: frontier-class reasoning at a fraction of the memory Sparse attention with IndexShare indexer for efficient long-context Ready to serve on Blackwell / Grace Blackwell, run it now with SGLang!
Cookbook:
Similar Articles
nvidia/GLM-5.2-NVFP4
NVIDIA released GLM-5.2-NVFP4, a quantized version of ZAI's GLM-5.2 MoE language model optimized for inference on NVIDIA Blackwell GPUs using Model Optimizer.
@mr_r0b0t: Official @NVIDIAAI GLM5.1-NVFP4 spotted on @huggingface
NVIDIA releases GLM-5.1-NVFP4, a quantized version of ZAI's GLM-5.1 model with 754B total parameters (40B activated), available on Hugging Face under MIT license.
@HuggingPapers: NVIDIA just released an optimized GLM-5.2 on Hugging Face A 753B parameter MoE with 1M context, quantized to NVFP4 for …
NVIDIA released an optimized GLM-5.2 MoE model on Hugging Face with 753B parameters and 1M context, quantized to NVFP4 for Blackwell GPUs while nearly matching FP8 accuracy.
zai-org/GLM-5.2-FP8
Z.AI releases GLM-5.2, a flagship open-source model with a solid 1M-token context, improved coding capabilities, and a new IndexShare sparse attention architecture that reduces FLOPs by 2.9x at 1M context.
@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…
A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.