Tag
OpenAI announces it will no longer report SWE-bench Verified scores, citing two critical issues: 59.4% of failed problems have flawed test cases that reject correct solutions, and frontier models have seen benchmark problems during training, making improvements reflect training data exposure rather than genuine capability gains.
OpenAI introduces GDPval, a new evaluation framework measuring AI model performance on economically valuable, real-world tasks across 44 occupations in the top 9 US GDP-contributing industries. The benchmark includes 1,320 specialized tasks based on actual professional work products, representing a progression from academic benchmarks to more realistic occupational assessments.