@JongwonPar9958: GLM-5.2 has a neat trick for reward hacking. They don't penalize the model, they detect the suspicious tool call, block…

X AI KOLs Timeline 06/19/26, 02:28 PM News

reward-hacking ai-safety training-technique glm-5.2 model-training tool-call recontextualization

Summary

GLM-5.2 uses a technique to counteract reward hacking by detecting and blocking suspicious tool calls rather than penalizing the model, which prevents obfuscation seen in other methods.

GLM-5.2 has a neat trick for reward hacking. They don't penalize the model, they detect the suspicious tool call, block it, return dummy info, and keep training. The hack just stops paying off. @bobabowen et al (2503.11926) showed penalizing a CoT monitor instead pushes the model to obfuscate, hide the intent and keep hacking. So neutralizing the action vs penalizing the signal shouldn't behave the same. Recontextualization (2512.19027) and inoculation (2511.18397) are the same spirit, don't touch the reward signal. But I can't find a head to head. Dummy vs penalty, same env, measuring obfuscation. Anyone know one?

Original Article

View Cached Full Text

Cached at: 06/20/26, 08:24 PM

GLM-5.2 has a neat trick for reward hacking. They don’t penalize the model, they detect the suspicious tool call, block it, return dummy info, and keep training. The hack just stops paying off.

@bobabowen et al (2503.11926) showed penalizing a CoT monitor instead pushes the model to obfuscate, hide the intent and keep hacking. So neutralizing the action vs penalizing the signal shouldn’t behave the same. Recontextualization (2512.19027) and inoculation (2511.18397) are the same spirit, don’t touch the reward signal.

But I can’t find a head to head. Dummy vs penalty, same env, measuring obfuscation.

Anyone know one?

Similar Articles

@omarsar0: GLM-5.2 is great at design (Opus level IMO). I am also starting to see great results with long-running tasks, too. How …

X AI KOLs Following

GLM-5.2, an open-weight model with Opus-level design capabilities, incorporates an anti-hacking module trained via RL to mitigate reward hacking and improve performance on long-running tasks.

Detecting and Suppressing Reward Hacking with Gradient Fingerprints

arXiv cs.CL

This paper introduces Gradient Fingerprint (GRIFT), a method for detecting reward hacking in reinforcement learning with verifiable rewards by analyzing models' internal gradient computations rather than surface-level reasoning traces. The approach achieves over 25% relative improvement in detecting implicit reward-hacking behaviors across math, code, and logical reasoning benchmarks.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Hugging Face Daily Papers

Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.

Large Language Models Hack Rewards, and Society

arXiv cs.LG

Researchers from King's College London, Fudan University, and The Alan Turing Institute introduce the concept of 'societal hacking'—where LLMs trained via reinforcement learning exploit loopholes in societal regulations, similar to reward hacking. They introduce SocioHack, a benchmark of 72 societal environments, demonstrating that models learn to remain technically compliant while defeating regulatory intent.

Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds

arXiv cs.AI

This paper adapts AI Safety Gridworlds to text-based evaluation and finds that language model agents exhibit zero-shot reward hacking across scales, which is not corrected by standard RL mitigations.