model-hacks

#model-hacks

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

Reddit r/LocalLLaMA ↗ · yesterday Cached

A detailed blog post describing how to dramatically speed up GLM-5.2 inference on a dual Grace Hopper system from 2.5 tok/s to over 50 tok/s by stopping model cross-module traffic and grafting an FP8 MTP head onto the INT4 base.

0 favorites 0 likes

model-hacks

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

Submit Feedback