Tag
METR evaluated an early version of Claude Mythos Preview in March 2026 using their time-horizons task suite, estimating a 50%-time-horizon of at least 16 hours, indicating the model is at the upper end of what current benchmarks can measure, with caveats about stability at longer time ranges.
Stanford professor and CRFM director Percy Liang is highlighted for his influential work on AI model evaluation through HELM.
OpenAI publishes a framework for business leaders on using AI evaluations (evals) to measure and improve AI system performance in organizational contexts, distinguishing between frontier evals for model development and contextual evals tailored to specific business workflows.
OpenAI introduced IndQA, a new benchmark with 2,278 questions across 12 Indian languages and 10 cultural domains, designed to evaluate AI models' understanding of culturally nuanced and reasoning-heavy tasks that existing benchmarks fail to capture. Created with 261 domain experts, IndQA addresses the saturation of existing multilingual benchmarks like MMMLU and focuses on real-world cultural comprehension rather than translation or multiple-choice tasks.
OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.
OpenAI introduces SWE-Lancer, a benchmark of over 1,400 real-world freelance software engineering tasks from Upwork valued at $1 million USD, designed to evaluate AI model performance on practical engineering work and map model capabilities to economic value.
Morgan Stanley has successfully deployed AI solutions powered by GPT-4 across its wealth management division, with over 98% of advisor teams using the internal AI Assistant chatbot. The deployment was enabled by a robust evaluation framework that tests AI performance on real-world use cases like document summarization and multilingual translation before production rollout.
Anthropic is donating its open-source AI alignment tool, Petri, to Meridian Labs and releasing version 3.0 with improved adaptability and realism for testing large language models.