I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

Reddit r/LocalLLaMA 06/24/26, 01:30 PM Tools

inference-optimization model-hacks gh200 glm-5.2 expert-offload mtp vllm

Summary

A detailed blog post describing how to dramatically speed up GLM-5.2 inference on a dual Grace Hopper system from 2.5 tok/s to over 50 tok/s by stopping model cross-module traffic and grafting an FP8 MTP head onto the INT4 base.

No content available

Original Article

View Cached Full Text

Cached at: 06/24/26, 02:28 PM

# 2x GH200 for LLM inference, Part 3: GLM-5.2, expert offload, and the CPU question Source: [https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/](https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/) 2x GH200 for LLM inference, Part 3: GLM\-5\.2, expert offload, and the CPU question ## Introduction[https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#introduction](https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#introduction) [Part 1](https://dnhkng.github.io/posts/gh200-benchmarking/)measured the dual GH200 workstation as a memory system\.[Part 2](https://dnhkng.github.io/posts/gh200-benchmarking-part-2/)used those measurements to explain why DeepSeek V4 Flash can be fast in vLLM when the model layout fits the hardware: keep hot weights in HBM, avoid unnecessary Hopper\-to\-Hopper traffic, and use MTP only where the acceptance rate pays for the draft work\. GLM\-5\.2 starts at**2\.39 output tok/s**on this machine and after a lot of grinding finishes near**50 output tok/s**\. That is the whole post in one line\. Two moves close the gap: stop the model crossing between the two GH200 modules, then graft an FP8 MTP head onto the INT4 base\. Together they take a model that*doesn’t fit in VRAM*and serve it at a usable interactive speed\. That gap exists because GLM\-5\.2 is***too damn big***\. It doesn’t fit in HBM, so the Grace memory \(*luckily, I have 960 GB LPDDR5X*\) has to become part of the serving system\. The question jumps in difficulty from*how do I split the model over two Hoppers across a slow interconnect*and becomes to the harder:*how do I split it over two Grace\-Hopper modules and juggle the transfer of weights into two separate sets of VRAM?* The short version from my current measurements is below\.**TG**means token generation/decode throughput\.**PP**means prompt processing/prefill throughput\. Model artifactEngineHeadline batch\-1 TGStable batch\-4 TGBest PP\-heavy resultGLM\-5\.2\-FP8vLLM, TP2, expert UVA offload25\.66 output tok/s \(best\)23\.63 aggregate output tok/s543\.66 total tok/sGLM\-5\.2\-AWQ\-INT4vLLM, TP2, expert UVA offload43\.39 output tok/s median at`2048\-\>512`, MTP\-3 graft54\.92 aggregate output tok/s, MTP\-3 graft781\.00 total tok/sGLM\-5\.2 GGUF`UD\-IQ2\_XXS`llama\.cpp / ik\_llama\.cpp CPU3\.13\-3\.65 output tok/s short, 1\.72\-3\.62 longnot tested62\.88 pp tok/s with ik\_llama\.cpp The FP8 and AWQ batch\-1 MTP headline numbers are from`2048\-\>512`runs\. The FP8 MTP\-3 point had a 25\.64 output tok/s warm mean and 25\.66 best sample\. The AWQ batch\-1 number is now the median of a longer cold\-plus\-10\-warm repeat run, not the best single warm sample\. The AWQ batch\-4 number is the controlled MTP\-3 concurrency result; MTP\-4 reached a higher median, but was not repeatable enough to make the headline\. Wait,***why did I test a slow\-ass CPU version too?***A plausible local\-agent architecture is GLM\-5\.2 on CPU for slower planning, review, or difficult decisions, paired with a much faster DeepSeek V4 Flash instance on GPU for the high\-volume path\. In commercial\-model terms, that is the local version of an Opus/Sonnet style split:*a slower stronger model for the hard calls, and a fast model for the bulk of the work*\. Unfortunately, although it works in practice, it’s too damn slow\. ## The System Reminder[https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#the-system-reminder](https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#the-system-reminder) The machine is still the same dual Grace Hopper workstation: ComponentSpecGPUs2x Hopper H100, 96 GB HBM3 eachCPUs2x Grace, 72 cores eachHost memory480 GB LPDDR5X per Grace, 960 GB totalGPU local memory192 GB total HBMCUDA13\.0Driver580\.105\.08OSUbuntu 24\.04, aarch64 The[topology numbers from Part 1](https://dnhkng.github.io/posts/gh200-benchmarking/)remain the useful mental model: PathMeasured bandwidthLocal HBMabout 3,700 GB/sLocal Grace LPDDR to local Hopperabout 377\-380 GB/sRemote Grace LPDDR to Hopperabout 133 GB/sHopper to Hopper staged copyabout 57\-58 GB/s The model does not fit cleanly in HBM, so decode performance depends on how much expert traffic goes over Grace\-to\-Hopper C2C, and whether each Hopper is reading from its own local Grace memory rather than the remote module\. ## A Bandwidth Guestimate[https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#a-bandwidth-guestimate](https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#a-bandwidth-guestimate) Before measuring vLLM, I wanted a simple guestimate: if the model is split cleanly across both GH200 modules, and each Hopper streams only the active experts from its own local Grace memory, how fast should decode be without MTP? From the FP8 checkpoint headers, the routed expert weights are about 684 GiB across 76 MoE layers\. GLM\-5\.2 has 256 routed experts per MoE layer and activates 8 experts per token per MoE layer, so each token touches 8 / 256 = 1 / 32 of the routed expert pool\. That makes the active expert stream about 684 GiB / 32 = 21\.38 GiB per generated token if those experts are fetched from CPU memory every time\. This is only the active expert stream, not the whole checkpoint and not the dense attention path\. The optimistic bandwidth math is: AssumptionEffective expert streamBandwidth pathEstimated non\-MTP decodeOne module effectively serializes the stream21\.38 GiB/token377\-380 GB/s local Grace to Hopper15\-18 tok/sTwo modules split the layers, no pipeline overlap10\.69 GiB/token per module, two sequential stages377\-380 GB/s local Grace to Hopper15\-18 tok/sTwo modules split the layers, ideal steady pipeline10\.69 GiB/token per module377\-380 GB/s local Grace to Hopper30\-36 tok/s aggregateOffloaded experts are interleaved or remote21\.38 GiB/token equivalentabout 133 GB/s remote Grace to Hopperabout 6 tok/sTraffic falls onto the staged Hopper\-to\-Hopper path21\.38 GiB/token equivalentabout 57\-58 GB/sabout 2\-3 tok/s The expert sizes are in GiB while the measured bandwidths are in decimal GB/s\. Converting GiB to GB adds a factor of about 1\.074 to the byte stream, so this mismatch makes the table slightly conservative\. The ranges are wide enough that it does not change the conclusion\. This is deliberately a bandwidth ceiling, ignoring routing overhead, attention, dense layers, synchronization, kernel efficiency, page placement mistakes, and the fact that a single request does not automatically fill a two\-stage pipeline\. If a strict local\-NUMA run lands near 15\-18 tok/s batch\-1, the system is behaving like the active experts are being streamed over C2C\. If it lands near 2\-6 tok/s, the layout is probably paying remote\-memory or cross\-module traffic, and we have messed up our settings\. ## What I Tested[https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#what-i-tested](https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#what-i-tested) I tested three local vLLM artifacts, two from HuggingFace, and one Frankenstein I built during this project: ModelLocationNotes**zai\-org/GLM\-5\.2\-FP8**[GLM\-5\.2\-FP8](https://huggingface.co/zai-org/GLM-5.2-FP8)Official FP8\-style artifact, 754B\-class MoE, MTP tensors present**cyankiwi/GLM\-5\.2\-AWQ\-INT4**[cyankiwi/GLM\-5\.2\-AWQ\-INT4](https://huggingface.co/cyankiwi/GLM-5.2-AWQ-INT4)AWQ INT4 artifact, loads through compressed\-tensors / Marlin WNA16AWQ \+ FP8 MTP graftcyankiwi/GLM\-5\.2\-AWQ\-INT4\-MTP\-FP8Local experimental graft: AWQ base model plus FP8 layer\-78 MTP tensors from the official FP8 artifact The INT4 checkpoint changes the byte count a lot, but probably not the token generation speed quite so much\. A crude half\-byte\-per\-weight expert\-stream estimate would put the same ideal local\-memory ceiling roughly around twice the FP8 ceiling\. In practice, INT4 is not just a smaller byte stream: Marlin/AWQ kernel costs, dequantization, graph capture, and vLLM placement all add up\. *The first FP8 baseline was awful*:**2\.39 output tok/s**\. It was mostly a placement problem, with transfers of weights crossing between GH200 modules\. After switching to strict local NUMA placement and reducing the amount of expert offload until the HBM/KV tradeoff stopped improving, the practical non\-MTP batch\-1 result was: ConfigShapeResultTP2, offload 270 GiB/rank, non\-MTP1 x 256\-\>51220\.31 output tok/sTP2, offload 260 GiB/rank, non\-MTP, maxlen 30721 x 256\-\>51220\.53 output tok/s The`260 GiB`point is technically fastest, but it only works by reducing max context to 3,072\. For a general launcher, I would not use it\. The safer FP8 non\-MTP point is`270 GiB`expert offload with a 4,096\-token max context\. That 20 tok/s result is tip: it is above the simple serialized 15\-18 tok/s estimate\. The likely interpretation is we are getting partial overlap across the two GH200 modules: not the ideal 30\-36 tok/s steady pipeline, but clearly better than a fully serialized expert stream\. For short prompts, MTP was much less exciting than it was for DeepSeek V4 Flash, where we saw big bumps in performance\.: ConfigShapeResultnon\-MTP, offload 300 GiB/rank, batched 20481 x 256\-\>51219\.33 output tok/sMTP\-1, offload 300 GiB/rank, batched 10241 x 256\-\>51218\.43 output tok/sMTP\-1, offload 300 GiB/rank, batched 20481 x 256\-\>51221\.22 output tok/sMTP\-1, offload 300 GiB/rank, batched 40961 x 256\-\>51219\.09 output tok/sMTP\-2, offload 300 GiB/rank, batched 20481 x 256\-\>5128\.87 output tok/s Even MTP\-1 is only a small win\. It reached 21\.22 output tok/s, which is 9\.8 percent faster than the matched 300 GiB non\-MTP placement, but only 4\.5 percent faster than the best practical 270 GiB non\-MTP placement\. The draft layer is not free, and enabling it forces a different HBM/offload tradeoff\. However, that short\-prompt result was not the whole story\. With a more realistic`2048\-\>512`batch\-1 workload and a 4096 scheduled\-token cap, the optimum moved upward: Spec tokensShapeCold output tok/sWarm output tok/sWarm acceptanceDecisionMTP\-11 x 2048\-\>51222\.6021\.94, 22\.7286\.50\-97\.30%BaselineMTP\-21 x 2048\-\>51218\.6823\.78, 23\.0082\.22\-87\.17%Better than MTP\-1MTP\-31 x 2048\-\>51224\.2325\.61, 25\.6693\.58%Best measuredMTP\-41 x 2048\-\>51221\.6225\.48, 16\.4847\.59\-89\.06%Unstable, stop I stopped there rather than running MTP\-5\. The rule was to walk upward and stop when the curve got worse\. MTP\-4 produced one good warm run and then collapsed on the second warm run, with acceptance falling to 47\.59 percent and output throughput falling to 16\.48 tok/s\. For concurrent token generation, MTP is still a disaster in the measured setup: ConfigShapeResultMTP\-1, offload 300 GiB/rank4 x 256\-\>51215\.15 aggregate output tok/snon\-MTP, offload 270 GiB/rank4 x 256\-\>51223\.63 aggregate output tok/s So I would not make MTP the default concurrent\-serving profile for FP8\. It is a batch\-1 latency/throughput knob, and the best speculative depth depends on prompt length and output shape\. The FP8 headline PP\-heavy result came from a separate non\-MTP run: ConfigShapeOutput tok/sTotal tok/sPrompt\-processing snapshotnon\-MTP, offload 270 GiB/rank, PP\-heavy4 x 2048\-\>6416\.47543\.66624\.5 prompt tok/s ## INT4: Faster, But With A Different Tradeoff[https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#int4-faster-but-with-a-different-tradeoff](https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#int4-faster-but-with-a-different-tradeoff) The AWQ INT4 model was the better vLLM serving target on this machine\. It loads as`compressed\-tensors`, and vLLM selected Marlin WNA16 kernels for both linear and MoE paths\. In the first serving sweep, the best measured dual\-GH200 batch\-1 decode was: WorkloadOutput tok/sTotal tok/sTPOT256\-\>512, concurrency 124\.7037\.0637\.39 ms256\-\>1024, concurrency 126\.1632\.7037\.67 ms2048\-\>64, concurrency 117\.61581\.2237\.94 ms The best measured throughput profile was: WorkloadOutput tok/sTotal tok/sMean TPOT4 x 256\-\>51236\.9855\.47103\.79 ms4 x 2048\-\>6423\.67781\.00114\.32 ms That made the INT4 artifact the practical vLLM choice even before MTP\. It was faster than FP8 in every measured comparable serving shape\. Originally, the tradeoff was MTP\. The INT4 checkpoint itself does not include the MTP layer\-78 weights, so MTP startup fails before we get to any acceptance\-rate question\. ## AWQ \+ FP8 MTP Graft[https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#awq--fp8-mtp-graft](https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#awq--fp8-mtp-graft) To test whether GLM\-5\.2’s MTP head was actually useful, I made a local experimental graft: keep the AWQ INT4 base model, add the FP8 layer\-78 MTP tensors from the official FP8 artifact, merge the safetensors index, and patch vLLM so the draft layer can use the FP8 quantization path while the base model stays on AWQ/Marlin\. This is not a clean official checkpoint, but it answers the systems question\. To make that reproducible without redistributing a full merged model, I published a small delta repo:[dnhkng/GLM\-5\.2\-AWQ\-INT4\-FP8\-MTP\-delta](https://huggingface.co/dnhkng/GLM-5.2-AWQ-INT4-FP8-MTP-delta)\. It contains only the`model\.layers\.78\.\*`MTP tensors extracted from`zai\-org/GLM\-5\.2\-FP8`, plus`graft\_glm52\_awq\_mtp\.sh`\. The delta is 1,569 tensors from the FP8 MTP layer, not a replacement for the AWQ checkpoint\. The intended workflow is: `1 2 3 4 \./graft\_glm52\_awq\_mtp\.sh \\ \-\-awq\-dir /path/to/GLM\-5\.2\-AWQ\-INT4 \\ \-\-mtp\-delta\-dir /path/to/GLM\-5\.2\-AWQ\-INT4\-FP8\-MTP\-delta \\ \-\-out\-dir /path/to/GLM\-5\.2\-AWQ\-INT4\-MTP\-FP8` The script leaves the AWQ weights unchanged, adds the FP8 MTP layer tensors, updates`model\.safetensors\.index\.json`, and adds`mtp\_quantization\_config`to`config\.json`so vLLM can route the draft layer through the FP8 quantization path while keeping the base model on AWQ/Marlin\. The required vLLM changes were small but specific: allow an MTP\-only quantization override in the DeepSeek/GLM decoder layer, read that override from a local`mtp\_quantization\_config`, and skip missing mixed\-quantization parameter names while loading the grafted AWQ/FP8 checkpoint\. Without the MTP\-only FP8 quantization override, the graft loaded but acceptance was effectively zero\. The answer is: yes, MTP helps the AWQ path a lot when it is wired up correctly\. For the short\-shape comparison below, I re\-ran the non\-MTP AWQ baseline in the same benchmark setup as the grafted model, which is why these baseline values are a little higher than the earlier general serving sweep\. Use these re\-measured non\-MTP rows for the MTP improvement percentages; the earlier 24\.70 and 26\.16 tok/s rows are from the first broader INT4 serving sweep, not the controlled graft comparison\. ProfileShapeCold output tok/sWarm output tok/sWarm TPOTAcceptanceAWQ non\-MTP256\-\>51225\.7726\.61\-26\.63, mean 26\.6236\.51 msn/aAWQ \+ MTP\-1256\-\>51226\.9637\.29\-41\.79, mean 38\.8224\.72 ms98\.58%AWQ non\-MTP256\-\>1024not run26\.94\-26\.95, mean 26\.9536\.58 msn/aAWQ \+ MTP\-1256\-\>1024not run37\.81\-38\.08, mean 37\.9525\.81 ms98\.84% The first MTP request still pays first\-shape JIT overhead\. In the cold 256\-\>512 MTP run, TTFT was 4\.17 seconds and the log showed Triton JIT compilation for slot mapping, prefill metadata, EAGLE/MTP input preparation, and rejection sampling kernels\. After that, TTFT returned to roughly 0\.59 seconds and the steady decode path sat around 38\-39 output tok/s\. The very high acceptance rates here are from these synthetic benchmark prompts\. Real agent prompts and structured continuations may have lower acceptance, so the short\-shape 41\-46 percent gain should be treated as a measured benchmark result, not a guaranteed application\-level speedup\. I then repeated the speculative\-depth sweep with a stricter rule: one cold run plus ten warm runs, no discarded noisy samples, and prompt lengths from`256\-\>512`up to`8192\-\>512`\. The server used`MAX\_MODEL\_LEN=9216`,`MAX\_NUM\_BATCHED\_TOKENS=9216`,`MAX\_NUM\_SEQS=1`,`TP\_SIZE=2`,`CPU\_OFFLOAD\_GB=170`, expert UVA offload, local NUMA binding, and FP8 MLA KV cache\. The practical comparison is MTP\-3 versus MTP\-4: ProfileShapeRunsMedian output tok/sMinMaxP10P90CVMedian TPOTMedian acceptanceSub\-60 acceptance runsAWQ non\-MTP256\-\>5121125\.1523\.9425\.1925\.1225\.180\.01438\.62 msn/a0AWQ non\-MTP2048\-\>5121124\.0324\.0124\.0524\.0224\.050\.00139\.31 msn/a0AWQ non\-MTP4096\-\>5121123\.0623\.0223\.0923\.0523\.070\.00139\.41 msn/a0AWQ non\-MTP8192\-\>5121121\.3621\.2421\.3821\.3321\.370\.00239\.46 msn/a0AWQ \+ MTP\-3256\-\>5121147\.2734\.5055\.0636\.3552\.090\.13620\.01 ms92\.16%1AWQ \+ MTP\-32048\-\>5121143\.3933\.3256\.7234\.4346\.130\.14720\.66 ms91\.48%2AWQ \+ MTP\-34096\-\>5121142\.9740\.3748\.3340\.3946\.460\.06119\.23 ms96\.95%0AWQ \+ MTP\-38192\-\>5121135\.6927\.1738\.7828\.8238\.110\.10520\.58 ms94\.03%1AWQ \+ MTP\-4256\-\>5121145\.7736\.7970\.0238\.2961\.830\.21120\.69 ms74\.61%2AWQ \+ MTP\-42048\-\>5121146\.8732\.3163\.5535\.8657\.280\.19618\.96 ms84\.83%2AWQ \+ MTP\-44096\-\>5121145\.9736\.4754\.6837\.2948\.720\.10817\.71 ms92\.20%0AWQ \+ MTP\-48192\-\>5121129\.5822\.7743\.1327\.1242\.020\.20426\.19 ms56\.37%6 This changes the AWQ story again\. MTP\-4 is not just “interesting but noisy”; it fails as a default\. It has excellent best\-case rows, including 70\.02 output tok/s on one short synthetic prompt, but under longer prompts the tail is ugly\. At`8192\-\>512`, six of eleven MTP\-4 runs fell below 60 percent acceptance, and the worst warm run dropped to 22\.77 output tok/s, essentially back near non\-MTP speed\. MTP\-3 is not magic either\. It had prompt\-sensitive low\-acceptance rows, including two sub\-60 percent acceptance runs at`2048\-\>512`and one at`8192\-\>512`\. But its lower tail is better, its coefficient of variation is lower, and it stays clearly above non\-MTP in all tested batch\-1 shapes\. For real use on this system, the launcher default is now the MTP\-3`stable`profile with`MAX\_MODEL\_LEN=635904`\. The`9216`value in these tests is the benchmark/scheduler token budget used for the`8192\-\>512`sweep, not the production context limit\. I also repeated the`2048\-\>512`test with true concurrency, using`MAX\_NUM\_SEQS=4`\. These are aggregate output throughput numbers: ProfileShapeConcurrencyRunsMedian output tok/sMinMaxP10P90CVMedian TPOTMedian acceptanceAWQ \+ MTP\-32048\-\>51221147\.9241\.8760\.6442\.3958\.850\.12934\.65 ms80\.33%AWQ \+ MTP\-32048\-\>51241154\.9248\.4563\.9650\.7158\.870\.07660\.86 ms77\.42%AWQ \+ MTP\-42048\-\>51221150\.5035\.0867\.8141\.2263\.480\.18632\.31 ms81\.02%AWQ \+ MTP\-42048\-\>51241157\.1749\.5467\.8349\.7067\.010\.11156\.51 ms72\.21% The concurrency result is a useful sanity check\. MTP\-4 still has higher headline medians, but the same tail problem remains\. At concurrency 2 it had a warm run with only 46\.24 percent acceptance and 35\.08 aggregate output tok/s, below the MTP\-3 p10\. At concurrency 4 it was faster on median, but acceptance was lower and variance was higher\. That is not the kind of repeatability I want in a default launcher\. I also ran a small fixed\-prompt sanity check with four prompts: coding review, GH200 systems reasoning, blog summarization, and benchmark design\. This is not as strong as the full synthetic sweep, because it is only two runs per profile, but it is useful for checking whether synthetic random\-token acceptance is too optimistic: ProfileRunsMedian output tok/sMinMaxMedian TPOTMedian acceptanceAWQ \+ MTP\-3235\.4434\.8136\.0725\.99 ms62\.73%AWQ \+ MTP\-4236\.0735\.4036\.7426\.84 ms56\.55% That result makes the caveat concrete\. Real prompts drove acceptance much lower than the synthetic runs for both profiles\. MTP\-4 was only marginally faster on median output throughput and had lower median acceptance, so it still does not justify replacing MTP\-3 as the default\. This is very different from the FP8 result\. FP8 MTP\-1 was only a narrow batch\-1 win, and it lost badly for concurrent token generation\. The AWQ graft has a much better ratio: the draft layer is cheap enough, and accepted often enough, that MTP\-3 roughly halves TPOT versus the non\-MTP baseline in the controlled batch\-1 tests\. The caveat is important: the base`cyankiwi`AWQ artifact still does not ship usable MTP weights, so everything above depends on a local graft plus local vLLM patches for mixed AWQ/FP8 loading\. The delta repo makes the graft reproducible, but this is still a systems experiment, not an official merged model release\. ## Context Capacity[https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#context-capacity](https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#context-capacity) With the speed settings locked\-down, the last thing was to optimise the context length\. I tested the AWQ \+ FP8 MTP graft as a context\-capacity profile with`MAX\_NUM\_SEQS=2`,`GPU\_UTIL=0\.90`,`CPU\_OFFLOAD\_GB=170`,`kv\_cache\_dtype=fp8\_ds\_mla`, and CUDA graph memory profiling enabled\. By varying the context size, and using the remaining VRAM to triangulate, I was able to quickly optimise the launch flags: SettingResultMAX\_MODEL\_LEN635,904 tokensMAX\_NUM\_SEQS2Reported available KV cache memory32\.42 GiBReported GPU KV cache size635,904 tokensReported maximum concurrency at 635,904 tokens1\.00x ## Single GH200 Did Not Work Yet[https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#single-gh200-did-not-work-yet](https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#single-gh200-did-not-work-yet) I also tried to make the INT4 artifact run through vLLM on one GH200 module\. ConfigResult`cpu\_offload\_gb=330`,`max\_model\_len=4096`,`gpu\_util=0\.90`Model loaded, then KV init failed with`Available KV cache memory: \-0\.38 GiB``cpu\_offload\_gb=350`,`max\_model\_len=2048`,`gpu\_util=0\.95`Worker died during startup before the detailed Python error was captured This vLLM \+ AWQ artifact path is close enough to the edge that I do not want to describe single\-GH200 serving as supported\. It may be fixable with a different offload path, a smaller quant, or a vLLM\-side startup fix, but I do not have a clean result yet\. ## The CPU/GGUF Result[https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#the-cpugguf-result](https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#the-cpugguf-result) The most hopeful follow\-up was CPU serving\. I*really*wanted to have GLM5\.2 do the slow heavy planning on CPU and have DeepSeek v4 Flash on GPU do the legwork\. Unsloth has a GLM\-5\.2 GGUF repo with llama\.cpp examples and several quantization levels\. The public size table lists: QuantListed size`UD\-IQ2\_XXS`238 GB`UD\-Q3\_K\_M`343 GB`UD\-Q4\_K\_M`466 GB`UD\-Q5\_K\_M`561 GB`Q8\_0`801 GBBF161\.51 TB The dual Grace side has 960 GB of LPDDR5X\. A 2\-bit or 3\-bit GGUF should fit entirely in CPU memory, and even Q4\_K\_M is plausible\. If llama\.cpp can run GLM\-5\.2 at a few tokens per second on CPU while leaving both Hoppers free, that unlocks the ultimate fast/slow combo from the intro on a single box: RoleModelHardwareFast workerDeepSeek V4 Flashdual HoppersSlow planner/reviewerGLM\-5\.2 GGUFGrace CPUs I started with`UD\-IQ2\_XXS`, a severely lobotomised model, because the question is*will this work*, not whether it’s smart\. The result is yes, but only with careful placement: EngineQuantThreads / NUMAPromptOutputPPTGllama\.cpp 063d9c1UD\-IQ2\_XXSnode1 bind/membind, 72 threads2561289\.65 tok/s3\.13 tok/sllama\.cpp 063d9c1UD\-IQ2\_XXSnode1 bind/membind, 72 threads20481283\.87 tok/s3\.62 tok/sik\_llama\.cpp 6c00e87UD\-IQ2\_XXSnode1 bind/membind, 72 threads25612851\.54 tok/s3\.65 tok/sik\_llama\.cpp 6c00e87UD\-IQ2\_XXSnode1 bind/membind, 72 threads204812862\.88 tok/s1\.72 tok/s The memory footprint was about 234 GiB RSS\. Both GPUs remained free\. The`ik\_llama\.cpp`result is worth separating from the serving conclusion\. It is dramatically faster at prompt processing on this GGUF, and for a long\-prompt batch it cut wall time from roughly eighteen minutes in my upstream llama\.cpp run to under two minutes, but it did not improve the steady token stream\. In the 2048\-token prompt test, decode fell to 1\.72 tok/s \(*defo in the**useless**range*\)\. PlacementThreadsShapePPTGnode0 bind/membind72256\-\>3214\.95 tok/s1\.42 tok/snode1 bind/membind72256\-\>3213\.45 tok/s4\.30 tok/sinterleave 0,1144256\-\>3211\.79 tok/s0\.63 tok/sdefault144256\-\>3211\.11 tok/s0\.62 tok/s *For this GGUF and llama\.cpp build, using both Grace CPUs was much worse than binding to node1\.* ## Current Takeaways[https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#current-takeaways](https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#current-takeaways) The bandwidth guestimate turned out to be a useful ruler\. The simple FP8 no\-MTP ceiling suggested that a well\-placed local\-memory run should land around 15\-18 output tok/s for a serialized batch\-1 stream, with an*optimistic*two\-module steady pipeline closer to 30\-36 tok/s aggregate\. The measured FP8 non\-MTP result,**about 20 output tok/s**,*is above the serialized estimate*: I speculate the two GH200 modules appear to get some cross\-module pipeline overlap, landing between the no\-overlap and ideal\-overlap rows, or I have messed up the math\. The measured INT4 result is consistent with the same byte\-rate story, just messier\. Plain AWQ runs in the low\-to\-mid 20s output tok/s across the controlled`256\-\>512`through`8192\-\>512`sweep — better than FP8 in every comparable shape, but well short of the clean 2x the smaller byte stream might suggest, because AWQ/Marlin execution, dequantization, CUDA graph capture, vLLM scheduling, and MoE routing all eat into the savings\. The*real win*comes from the hacky\-graft: bolting on MTP\-3 lifts the practical batch\-1 default into the low 40s \(43\.39 tok/s median at`2048\-\>512`\) and reaches 54\.92 aggregate output tok/s at concurrency 4\. MTP\-4 has faster best\-case samples, but the acceptance collapses documented above keep it out of the default\. ### Worst to Best[https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#worst-to-best](https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#worst-to-best) The footprint values are approximate model\-weight footprints from the artifacts and checkpoint headers: about 1\.51 TB for BF16, about 833 GiB for the FP8 artifact, and about 430 GiB for the AWQ artifact\. ConfigurationApprox footprintRepresentative resultWhat changedBF16 full weights1\.51 TBnot runDoes not fit in 960 GB Grace memoryFP8, naive placement~833 GiB2\.39 output tok/sCross\-module transfers kill the runFP8, strict local\-NUMA offload~833 GiB20\.31 output tok/sPlacement alone gives about an 8\.5x speedupFP8 \+ MTP\-3, workload\-tuned~833 GiB25\.66 output tok/sSpeculation helps when the shape is rightAWQ INT4, plain~430 GiB24\.03 output tok/s median at`2048\-\>512`Smaller stream and better base serving targetAWQ INT4 \+ grafted FP8 MTP head, MTP\-3 stable~430 GiB43\.39 tok/s single; 54\.92 at concurrency 4Same base footprint; the gain comes from the graft, MTP\-3, and high enough acceptance ## Series Takeaway[https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#series-takeaway](https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/#series-takeaway) Across the three posts, the useful deployment map is: GoalBest answer from the seriesUnderstand the boxTreat it as two fast GH200 modules joined by a much slower bridgeFast local servingDeepSeek V4 Flash Canada\-Quant, MTP benchmarked, stable profile firstLargest vLLM model testedGLM\-5\.2 INT4 with strict local\-NUMA expert offload and experimental MTP graftCPU\-only huge modelGLM\-5\.2 IQ2 works, but only at low single\-digit decodeMain hardware ruleKeep hot traffic local to each GH200 module

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

Similar Articles

GLM 5.2 on consumer hardware

Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)

@karminski3: Local deployment of GLM-5.2 with vLLM finally gets fast! Good news for local GLM-5.2 deployment! As we know, GLM-5.2 now comes with a built-in MTP head for speculative decoding. However, this only works with the bf16 original precision GLM-5.2, which...

@Ex0byt: Update: the road to GLM-5.2: we're getting there, folks! non-quantized, non-pruned DeepSeek-v4-Flash. 11tok/s on a sing…

LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels

Submit Feedback

Similar Articles

Giving GLM-5.2 a spin locally on CPU only! (poor man's rig for big models)

@karminski3: Local deployment of GLM-5.2 with vLLM finally gets fast! Good news for local GLM-5.2 deployment! As we know, GLM-5.2 now comes with a built-in MTP head for speculative decoding. However, this only works with the bf16 original precision GLM-5.2, which...

@Ex0byt: Update: the road to GLM-5.2: we're getting there, folks! non-quantized, non-pruned DeepSeek-v4-Flash. 11tok/s on a sing…

LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels
LFM2.5 230M model achieves 1,400 tokens per second in-browser using custom WebGPU kernels, demonstrating efficient local inference.