Why do newer SOTA models get progressively worse on Vendingbench?
Summary
A discussion on why newer state-of-the-art AI models are performing worse on the Vendingbench benchmark, suggesting factors such as cheating in earlier runs, ethical alignment reducing profit-seeking behavior, and catastrophic forgetting due to overemphasis on coding.
Similar Articles
Does anyone else feel like AI benchmarks are becoming less useful for predicting real-world performance?
The article discusses the growing disconnect between high AI benchmark scores and actual real-world performance, highlighting issues like consistency, latency, and context handling.
@rohanpaul_ai: Univ of Texas paper shows AI agents can slowly become less reliable after deployment, even when the model itself does n…
A University of Texas paper introduces AgingBench, a benchmark that reveals AI agents can become less reliable after deployment due to memory and maintenance decay, even when the underlying model remains unchanged.
Why we no longer evaluate SWE-bench Verified
OpenAI announces it will no longer report SWE-bench Verified scores, citing two critical issues: 59.4% of failed problems have flawed test cases that reject correct solutions, and frontier models have seen benchmark problems during training, making improvements reflect training data exposure rather than genuine capability gains.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
This paper introduces BenchJack, an automated red-teaming system that systematically audits AI agent benchmarks by identifying reward-hacking exploits. It applies BenchJack to 10 popular benchmarks, surfacing 219 distinct flaws and demonstrating that evaluation pipelines lack an adversarial mindset, with the system reducing hackable-task ratios from near 100% to under 10% on four benchmarks.
Benchmarks Say One Thing. The Vibes Say Another.
The author argues that recent AI model releases like Claude Opus 4.8 and GPT 5.5 are incremental, similar to iPhone upgrades, and that the real innovation is shifting to tooling layers such as Claude Code and Codex.