Why do newer SOTA models get progressively worse on Vendingbench?

Reddit r/singularity News

Summary

A discussion on why newer state-of-the-art AI models are performing worse on the Vendingbench benchmark, suggesting factors such as cheating in earlier runs, ethical alignment reducing profit-seeking behavior, and catastrophic forgetting due to overemphasis on coding.

I have multiple theories, but I thought maybe you guys already know more. Things that I consider to be possible (but not certain) factors: 1. Models like Opus 4.5 cheated in earlier runs and the Team didn’t normalise scores/gains for these behaviours, even though it doesn’t reflect anything useful regarding the core objectives of the benchmarks 2. Maybe the ethical alignment restructures the financial performance goals in a way that the model strives for more fair pricing, refund conditions and so on. 3. Because of the shorter training cycles induced by the hype train, models get systematically pushed in high-reward domains like coding, without balancing other less prominent areas enough, which should cause disruption like ‚catastrophic forgetting‘, also regarding specific skills. Managing a full business isn’t something I would call a popular LLM use case these days. Depending which factors actually come into play, the depredation is a bad sign or actually somewhat of an improvement. What do you think?
Original Article

Similar Articles

Why we no longer evaluate SWE-bench Verified

OpenAI Blog

OpenAI announces it will no longer report SWE-bench Verified scores, citing two critical issues: 59.4% of failed problems have flawed test cases that reject correct solutions, and frontier models have seen benchmark problems during training, making improvements reflect training data exposure rather than genuine capability gains.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

arXiv cs.AI

This paper introduces BenchJack, an automated red-teaming system that systematically audits AI agent benchmarks by identifying reward-hacking exploits. It applies BenchJack to 10 popular benchmarks, surfacing 219 distinct flaws and demonstrating that evaluation pipelines lack an adversarial mindset, with the system reducing hackable-task ratios from near 100% to under 10% on four benchmarks.

Benchmarks Say One Thing. The Vibes Say Another.

Reddit r/AI_Agents

The author argues that recent AI model releases like Claude Opus 4.8 and GPT 5.5 are incremental, similar to iPhone upgrades, and that the real innovation is shifting to tooling layers such as Claude Code and Codex.