@karminski3: Local deployment of GLM-5.2 with vLLM finally gets fast! Good news for local GLM-5.2 deployment! As we know, GLM-5.2 now comes with a built-in MTP head for speculative decoding. However, this only works with the bf16 original precision GLM-5.2, which...

X AI KOLs Timeline Models

Summary

Community efforts, including a hybrid quantization approach by dnhkng, have enabled vLLM and SGLang to support GLM-5.2 with MTP heads, boosting local inference speed from 2 token/s to over 43 token/s on dual GH200 hardware. The challenge involved managing DSA-based MTP and quantization compatibility.

The speed of deploying GLM-5.2 locally with vLLM has finally improved! Good news finally arrives for local GLM-5.2 deployment! As everyone knows, GLM-5.2 now comes with an MTP head, enabling speculative decoding. However, this only applies to the bf16 original precision GLM-5.2, which requires 1.5TB — few local setups have that luxury. So everyone uses various quantized versions, since 4-bit quantization brings it down to just 430GB. Here's the problem: because GLM-5.2's MTP uses a very special DSA (Dynamic Sparse Attention), the current inference engines (llama.cpp, vLLM, mlx) all fail to support it. Specifically, llama.cpp and mlx cannot enable MTP at all, while vLLM only supports FP8 precision. But SGLang is fine — its architecture is quite powerful and natively supports mixed precision within the same computation stream. So just use GLM-5.2-W4AFP8 directly. So back to the unsupported engines: for most quantized versions of GLM-5.2, enabling MTP actually degrades performance. Some quantized versions even completely remove the MTP part (mlx). Community author dnhkng came up with a hybrid method, eventually creating GLM-5.2-AWQ-INT4-FP8-MTP-delta: base uses INT4 (via Marlin kernel) + MTP uses FP8 (maintains precision), and it works with vLLM. Speed jumped from the original 2 token/s all the way to 43.39 token/s (with NUMA binding and MTP-3). So currently, SGLang and vLLM (modded version) can run GLM-5.2 with MTP at full throttle. Meanwhile, llama.cpp and mlx users still need to wait — the community is working on it. The author's blog post (extremely detailed, with many optimization tips): http://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/… #glm52 #mtp #dsa
Original Article
View Cached Full Text

Cached at: 06/25/26, 05:19 AM

Local deployment of GLM-5.2 with vLLM has finally caught up on speed!

Great news has finally arrived for deploying GLM-5.2 locally! As everyone knows, GLM-5.2 comes with its own MTP head, enabling speculative decoding.

However, this only applies to the bf16 original-precision GLM-5.2, which needs 1.5TB at original precision. Few local setups are that rich, so most people use various quantized versions – after all, 4-bit quantization only requires 430GB.

Here’s the problem: because GLM-5.2’s MTP uses a very special DSA (Dynamic Sparse Attention), current inference engines (llama.cpp, vLLM, mlx) cannot support it.

Among them, llama.cpp and mlx have absolutely no way to enable MTP, and vLLM only supports FP8 precision.

As for SGLang, it’s fine – SGLang’s architecture is quite impressive, supporting mixed precision in the same compute stream from the get-go. So you can just use GLM-5.2-W4AFP8 directly.

So back to these unsupported inference engines: most quantized versions of GLM-5.2 actually lose speed when MTP is enabled. Some quantized versions even cut the MTP part entirely (mlx).

Community author dnhkng came up with a stitching method, eventually creating GLM-5.2-AWQ-INT4-FP8-MTP-delta – that is, using INT4 for the base (with Marlin kernel) + FP8 for MTP (preserving precision) while making it compatible with vLLM. Speed jumped from the original 2 token/s to a stunning 43.39 token/s (with NUMA binding + MTP-3).

So far, both SGLang and vLLM (modded version) can run GLM-5.2 with MTP at full throttle. Llama.cpp and mlx users still need to wait a bit longer – the community is working on it.

The author’s blog post (the process is incredibly exciting, with many optimization tips): http://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/…

#glm52 #mtp #dsa


2x GH200 for LLM inference, Part 3: GLM-5.2, expert offload, and the CPU question

Source: https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/ 2x GH200 for LLM inference, Part 3: GLM-5.2, expert offload, and the CPU question

Introductionhttps://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#introduction

Part 1 (https://dnhkng.github.io/posts/gh200-benchmarking/)measured the dual GH200 workstation as a memory system.Part 2 (https://dnhkng.github.io/posts/gh200-benchmarking-part-2/)used those measurements to explain why DeepSeek V4 Flash can be fast in vLLM when the model layout fits the hardware: keep hot weights in HBM, avoid unnecessary Hopper-to-Hopper traffic, and use MTP only where the acceptance rate pays for the draft work.

GLM-5.2 starts at2.39 output tok/son this machine and after a lot of grinding finishes near50 output tok/s. That is the whole post in one line. Two moves close the gap: stop the model crossing between the two GH200 modules, then graft an FP8 MTP head onto the INT4 base. Together they take a model thatdoesn’t fit in VRAMand serve it at a usable interactive speed.

That gap exists because GLM-5.2 istoo damn big. It doesn’t fit in HBM, so the Grace memory (luckily, I have 960 GB LPDDR5X) has to become part of the serving system. The question jumps in difficulty fromhow do I split the model over two Hoppers across a slow interconnectand becomes to the harder:how do I split it over two Grace-Hopper modules and juggle the transfer of weights into two separate sets of VRAM?

The short version from my current measurements is below.TGmeans token generation/decode throughput.PPmeans prompt processing/prefill throughput.

Model artifactEngineHeadline batch-1 TGStable batch-4 TGBest PP-heavy resultGLM-5.2-FP8vLLM, TP2, expert UVA offload25.66 output tok/s (best)23.63 aggregate output tok/s543.66 total tok/sGLM-5.2-AWQ-INT4vLLM, TP2, expert UVA offload43.39 output tok/s median at2048\-\>512, MTP-3 graft54.92 aggregate output tok/s, MTP-3 graft781.00 total tok/sGLM-5.2 GGUFUD\-IQ2\_XXSllama.cpp / ik_llama.cpp CPU3.13-3.65 output tok/s short, 1.72-3.62 longnot tested62.88 pp tok/s with ik_llama.cpp The FP8 and AWQ batch-1 MTP headline numbers are from2048\-\>512runs. The FP8 MTP-3 point had a 25.64 output tok/s warm mean and 25.66 best sample. The AWQ batch-1 number is now the median of a longer cold-plus-10-warm repeat run, not the best single warm sample. The AWQ batch-4 number is the controlled MTP-3 concurrency result; MTP-4 reached a higher median, but was not repeatable enough to make the headline.

Wait,***why did I test a slow-ass CPU version too?***A plausible local-agent architecture is GLM-5.2 on CPU for slower planning, review, or difficult decisions, paired with a much faster DeepSeek V4 Flash instance on GPU for the high-volume path. In commercial-model terms, that is the local version of an Opus/Sonnet style split:a slower stronger model for the hard calls, and a fast model for the bulk of the work. Unfortunately, although it works in practice, it’s too damn slow.

The System Reminderhttps://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#the-system-reminder

The machine is still the same dual Grace Hopper workstation:

ComponentSpecGPUs2x Hopper H100, 96 GB HBM3 eachCPUs2x Grace, 72 cores eachHost memory480 GB LPDDR5X per Grace, 960 GB totalGPU local memory192 GB total HBMCUDA13.0Driver580.105.08OSUbuntu 24.04, aarch64 Thetopology numbers from Part 1 (https://dnhkng.github.io/posts/gh200-benchmarking/)remain the useful mental model:

PathMeasured bandwidthLocal HBMabout 3,700 GB/sLocal Grace LPDDR to local Hopperabout 377-380 GB/sRemote Grace LPDDR to Hopperabout 133 GB/sHopper to Hopper staged copyabout 57-58 GB/s The model does not fit cleanly in HBM, so decode performance depends on how much expert traffic goes over Grace-to-Hopper C2C, and whether each Hopper is reading from its own local Grace memory rather than the remote module.

A Bandwidth Guestimatehttps://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#a-bandwidth-guestimate

Before measuring vLLM, I wanted a simple guestimate: if the model is split cleanly across both GH200 modules, and each Hopper streams only the active experts from its own local Grace memory, how fast should decode be without MTP?

From the FP8 checkpoint headers, the routed expert weights are about 684 GiB across 76 MoE layers. GLM-5.2 has 256 routed experts per MoE layer and activates 8 experts per token per MoE layer, so each token touches 8 / 256 = 1 / 32 of the routed expert pool. That makes the active expert stream about 684 GiB / 32 = 21.38 GiB per generated token if those experts are fetched from CPU memory every time. This is only the active expert stream, not the whole checkpoint and not the dense attention path.

The optimistic bandwidth math is:

AssumptionEffective expert streamBandwidth pathEstimated non-MTP decodeOne module effectively serializes the stream21.38 GiB/token377-380 GB/s local Grace to Hopper15-18 tok/sTwo modules split the layers, no pipeline overlap10.69 GiB/token per module, two sequential stages377-380 GB/s local Grace to Hopper15-18 tok/sTwo modules split the layers, ideal steady pipeline10.69 GiB/token per module377-380 GB/s local Grace to Hopper30-36 tok/s aggregateOffloaded experts are interleaved or remote21.38 GiB/token equivalentabout 133 GB/s remote Grace to Hopperabout 6 tok/sTraffic falls onto the staged Hopper-to-Hopper path21.38 GiB/token equivalentabout 57-58 GB/sabout 2-3 tok/s The expert sizes are in GiB while the measured bandwidths are in decimal GB/s. Converting GiB to GB adds a factor of about 1.074 to the byte stream, so this mismatch makes the table slightly conservative. The ranges are wide enough that it does not change the conclusion.

This is deliberately a bandwidth ceiling, ignoring routing overhead, attention, dense layers, synchronization, kernel efficiency, page placement mistakes, and the fact that a single request does not automatically fill a two-stage pipeline. If a strict local-NUMA run lands near 15-18 tok/s batch-1, the system is behaving like the active experts are being streamed over C2C. If it lands near 2-6 tok/s, the layout is probably paying remote-memory or cross-module traffic, and we have messed up our settings.

What I Testedhttps://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#what-i-tested

I tested three local vLLM artifacts, two from HuggingFace, and one Frankenstein I built during this project:

ModelLocationNoteszai-org/GLM-5.2-FP8GLM-5.2-FP8 (https://huggingface.co/zai-org/GLM-5.2-FP8)Official FP8-style artifact, 754B-class MoE, MTP tensors presentcyankiwi/GLM-5.2-AWQ-INT4cyankiwi/GLM-5.2-AWQ-INT4 (https://huggingface.co/cyankiwi/GLM-5.2-AWQ-INT4)AWQ INT4 artifact, loads through compressed-tensors / Marlin WNA16AWQ + FP8 MTP graftcyankiwi/GLM-5.2-AWQ-INT4-MTP-FP8Local experimental graft: AWQ base model plus FP8 layer-78 MTP tensors from the official FP8 artifact The INT4 checkpoint changes the byte count a lot, but probably not the token generation speed quite so much. A crude half-byte-per-weight expert-stream estimate would put the same ideal local-memory ceiling roughly around twice the FP8 ceiling. In practice, INT4 is not just a smaller byte stream: Marlin/AWQ kernel costs, dequantization, graph capture, and vLLM placement all add up.

The first FP8 baseline was awful:2.39 output tok/s. It was mostly a placement problem, with transfers of weights crossing between GH200 modules.

After switching to strict local NUMA placement and reducing the amount of expert offload until the HBM/KV tradeoff stopped improving, the practical non-MTP batch-1 result was:

ConfigShapeResultTP2, offload 270 GiB/rank, non-MTP1 x 256->51220.31 output tok/sTP2, offload 260 GiB/rank, non-MTP, maxlen 30721 x 256->51220.53 output tok/s The260 GiBpoint is technically fastest, but it only works by reducing max context to 3,072. For a general launcher, I would not use it. The safer FP8 non-MTP point is270 GiBexpert offload with a 4,096-token max context.

That 20 tok/s result is tip: it is above the simple serialized 15-18 tok/s estimate. The likely interpretation is we are getting partial overlap across the two GH200 modules: not the ideal 30-36 tok/s steady pipeline, but clearly better than a fully serialized expert stream.

For short prompts, MTP was much less exciting than it was for DeepSeek V4 Flash, where we saw big bumps in performance.:

ConfigShapeResultnon-MTP, offload 300 GiB/rank, batched 20481 x 256->51219.33 output tok/sMTP-1, offload 300 GiB/rank, batched 10241 x 256->51218.43 output tok/sMTP-1, offload 300 GiB/rank, batched 20481 x 256->51221.22 output tok/sMTP-1, offload 300 GiB/rank, batched 40961 x 256->51219.09 output tok/sMTP-2, offload 300 GiB/rank, batched 20481 x 256->5128.87 output tok/s Even MTP-1 is only a small win. It reached 21.22 output tok/s, which is 9.8 percent faster than the matched 300 GiB non-MTP placement, but only 4.5 percent faster than the best practical 270 GiB non-MTP placement. The draft layer is not free, and enabling it forces a different HBM/offload tradeoff.

However, that short-prompt result was not the whole story. With a more realistic2048\-\>512batch-1 workload and a 4096 scheduled-token cap, the optimum moved upward:

Spec tokensShapeCold output tok/sWarm output tok/sWarm acceptanceDecisionMTP-11 x 2048->51222.6021.94, 22.7286.50-97.30%BaselineMTP-21 x 2048->51218.6823.78, 23.0082.22-87.17%Better than MTP-1MTP-31 x 2048->51224.2325.61, 25.6693.58%Best measuredMTP-41 x 2048->51221.6225.48, 16.4847.59-89.06%Unstable, stop I stopped there rather than running MTP-5. The rule was to walk upward and stop when the curve got worse. MTP-4 produced one good warm run and then collapsed on the second warm run, with acceptance falling to 47.59 percent and output throughput falling to 16.48 tok/s.

For concurrent token generation, MTP is still a disaster in the measured setup:

ConfigShapeResultMTP-1, offload 300 GiB/rank4 x 256->51215.15 aggregate output tok/snon-MTP, offload 270 GiB/rank4 x 256->51223.63 aggregate output tok/s So I would not make MTP the default concurrent-serving profile for FP8. It is a batch-1 latency/throughput knob, and the best speculative depth depends on prompt length and output shape. The FP8 headline PP-heavy result came from a separate non-MTP run:

ConfigShapeOutput tok/sTotal tok/sPrompt-processing snapshotnon-MTP, offload 270 GiB/rank, PP-heavy4 x 2048->6416.47543.66624.5 prompt tok/s

INT4: Faster, But With A Different Tradeoffhttps://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#int4-faster-but-with-a-different-tradeoff

The AWQ INT4 model was the better vLLM serving target on this machine.

It loads ascompressed\-tensors, and vLLM selected Marlin WNA16 kernels for both linear and MoE paths. In the first serving sweep, the best measured dual-GH200 batch-1 decode was:

WorkloadOutput tok/sTotal tok/sTPOT256->512, concurrency 124.7037.0637.39 ms256->1024, concurrency 126.1632.7037.67 ms2048->64, concurrency 117.61581.2237.94 ms The best measured throughput profile was:

WorkloadOutput tok/sTotal tok/sMean TPOT4 x 256->51236.9855.47103.79 ms4 x 2048->6423.67781.00114.32 ms That made the INT4 artifact the practical vLLM choice even before MTP. It was faster than FP8 in every measured comparable serving shape.

Originally, the tradeoff was MTP. The INT4 checkpoint itself does not include the MTP layer-78 weights, so MTP startup fails before we get to any acceptance-rate question.

AWQ + FP8 MTP Grafthttps://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#awq–fp8-mtp-graft

To test whether GLM-5.2’s MTP head was actually useful, I made a local experimental graft: keep the AWQ INT4 base model, add the FP8 layer-78 MTP tensors from the official FP8 artifact, merge the safetensors index, and patch vLLM so the draft layer can use the FP8 quantization path while the base model stays on AWQ/Marlin. This is not a clean official checkpoint, but it answers the systems question.

To make that reproducible without redistributing a full merged model, I published a small delta repo:dnhkng/GLM-5.2-AWQ-INT4-FP8-MTP-delta (https://huggingface.co/dnhkng/GLM-5.2-AWQ-INT4-FP8-MTP-delta). It contains only themodel\.layers\.78\.\*MTP tensors extracted fromzai\-org/GLM\-5\.2\-FP8, plusgraft\_glm52\_awq\_mtp\.sh. The delta is 1,569 tensors from the FP8 MTP layer, not a replacement for the AWQ checkpoint. The intended workflow is:

1 2 3 4 \./graft\_glm52\_awq\_mtp\.sh \\ \-\-awq\-dir /path/to/GLM\-5\.2\-AWQ\-INT4 \\ \-\-mtp\-delta\-dir /path/to/GLM\-5\.2\-AWQ\-INT4\-FP8\-MTP\-delta \\ \-\-out\-dir /path/to/GLM\-5\.2\-AWQ\-INT4\-MTP\-FP8

The script leaves the AWQ weights unchanged, adds the FP8 MTP layer tensors, updatesmodel\.safetensors\.index\.json, and addsmtp\_quantization\_configtoconfig\.jsonso vLLM can route the draft layer through the FP8 quantization path while keeping the base model on AWQ/Marlin.

The required vLLM changes were small but specific: allow an MTP-only quantization override in the DeepSeek/GLM decoder layer, read that override from a localmtp\_quantization\_config, and skip missing mixed-quantization parameter names while loading the grafted AWQ/FP8 checkpoint. Without the MTP-only FP8 quantization override, the graft loaded but acceptance was effectively zero.

The answer is: yes, MTP helps the AWQ path a lot when it is wired up correctly. For the short-shape comparison below, I re-ran the non-MTP AWQ baseline in the same benchmark setup as the grafted model, which is why these baseline values are a little higher than the earlier general serving sweep. Use these re-measured non-MTP rows for the MTP improvement percentages; the earlier 24.70 and 26.16 tok/s rows are from the first broader INT4 serving sweep, not the controlled graft comparison.

ProfileShapeCold output tok/sWarm output tok/sWarm TPOTAcceptanceAWQ non-MTP256->51225.7726.61-26.63, mean 26.6236.51 msn/aAWQ + MTP-1256->51226.9637.29-41.79, mean 38.8224.72 ms98.58%AWQ non-MTP256->1024not run26.94-26.95, mean 26.9536.58 msn/aAWQ + MTP-1256->1024not run37.81-38.08, mean 37.9525.81 ms98.84% The first MTP request still pays first-shape JIT overhead. In the cold 256->512 MTP run, TTFT was 4.17 seconds and the log showed Triton JIT compilation for slot mapping, prefill metadata, EAGLE/MTP input preparation, and rejection sampling kernels. After that, TTFT returned to roughly 0.59 seconds and the steady decode path sat around 38-39 output tok/s.

The very high acceptance rates here are from these synthetic benchmark prompts. Real agent prompts and structured continuations ma

Similar Articles

@jakevin7: Recently I've been reading about GLM 5.2 and found some interesting things to share. GLM-5.2 uses MTP (Multi-Token Prediction) to accelerate inference: a lightweight "draft model" quickly predicts multiple tokens, then the main model verifies them all at once; if accepted, it skips the decoding steps.

X AI KOLs Following

GLM-5.2 adopts MTP (Multi-Token Prediction) technology to accelerate inference and fixes a training-inference discrepancy in GLM-5.1's MTP that caused KV cache mixing issues.

GLM 5.2 on consumer hardware

Reddit r/LocalLLaMA

A user tested the unsloth quantized GLM-5.2 model on a high-end consumer-like system with dual RTX 5090, achieving 12 tokens per second.