gh200

#gh200

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

Reddit r/LocalLLaMA ↗ · yesterday Cached

A detailed blog post describing how to dramatically speed up GLM-5.2 inference on a dual Grace Hopper system from 2.5 tok/s to over 50 tok/s by stopping model cross-module traffic and grafting an FP8 MTP head onto the INT4 base.

0 favorites 0 likes

#gh200

Here are some tips on hitting nearly 200 tok/s for DeepSeek v4 Flash on Hopper

Reddit r/LocalLLaMA ↗ · 2026-06-08 Cached

This blog post provides tips and benchmarks for achieving nearly 200 tokens per second inference on DeepSeek V4 Flash using vLLM on a dual GH200 workstation, highlighting the use of a quantized checkpoint from Canada-Quant and tensor parallelism optimizations.

0 favorites 0 likes

gh200

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

Here are some tips on hitting nearly 200 tok/s for DeepSeek v4 Flash on Hopper

Submit Feedback