ai-assessment

#ai-assessment

Why we no longer evaluate SWE-bench Verified

OpenAI Blog ↗ · 2026-02-23 Cached

OpenAI announces it will no longer report SWE-bench Verified scores, citing two critical issues: 59.4% of failed problems have flawed test cases that reject correct solutions, and frontier models have seen benchmark problems during training, making improvements reflect training data exposure rather than genuine capability gains.

0 favorites 0 likes

#ai-assessment

Measuring the performance of our models on real-world tasks

OpenAI Blog ↗ · 2025-09-25 Cached

OpenAI introduces GDPval, a new evaluation framework measuring AI model performance on economically valuable, real-world tasks across 44 occupations in the top 9 US GDP-contributing industries. The benchmark includes 1,320 specialized tasks based on actual professional work products, representing a progression from academic benchmarks to more realistic occupational assessments.

0 favorites 0 likes

ai-assessment

Why we no longer evaluate SWE-bench Verified

Measuring the performance of our models on real-world tasks

Submit Feedback