Human Evaluation of GLM-5.2

Reddit r/LocalLLaMA Models

Summary

The author praises GLM-5.2, an MIT open-weights model, for its exceptional real-world performance in human evaluation benchmarks, claiming it rivals the best closed-source models like those from Claude.

I've seen plenty of benchmarks that put GLM-5.2 below many of the closed source alternatives but at their heels. I thought to myself, next version GLM will totally be where the best frontiers are at now. The last few days I've been testing it on a real world project, and it's basically Goated in my view. I wish I can run it locally but I've seen some madlads with the hardware that could around here. Today I ran into Design Arena's leaderboard for the first time, this is what OpenRouter bases its benchmarks numbers on.. and it's human voting based! You can plug in that Doner kebab test there and vote on the most delicious looking 🍢 Game Dev, GLM-5.2 one step below Fable 5 And almost in every category, GLM-5.2 is kicking tokens and taking names. In some of the tests, it's right below Fable which for all intents and purposes is MIA. Therefore, GLM-5.2, the MIT open-weights model.. is in my view, equivalent to the best models Claude has today 😳👏 I think we just won. So I guess most standardized benchmarks really don't reflect real-world performance anymore, either because they're based on old assumptions/expectations or simply because they're being blatantly gamed.
Original Article

Similar Articles

GLM-5.2 is the new leading open weights model on Artificial Analysis

Hacker News Top

Z ai's GLM-5.2 has become the new leading open weights model on the Artificial Analysis Intelligence Index, scoring 51 and outperforming competitors like MiniMax-M3 and DeepSeek V4 Pro. The model features 744B total parameters, 40B active, MIT license, and 1M context window.