Tag
GLM-5.2 uses a technique to counteract reward hacking by detecting and blocking suspicious tool calls rather than penalizing the model, which prevents obfuscation seen in other methods.
This tweet discusses the idea of training models with 'implementation noise' to improve robustness against float numerics problems caused by nondeterminism and nonassociativity.