ai-evaluation

#ai-evaluation

METR evaluated an early version of Claude Mythos

Reddit r/singularity ↗ · 16h ago

METR evaluated an early version of Claude Mythos Preview in March 2026 using their time-horizons task suite, estimating a 50%-time-horizon of at least 16 hours, indicating the model is at the upper end of what current benchmarks can measure, with caveats about stability at longer time ranges.

0 favorites 0 likes

#ai-evaluation

@su_kidd: Today’s card: Percy Liang. Not just building models — shaping how we evaluate them. Stanford professor, CRFM director, …

X AI KOLs Following ↗ · 2026-04-20 Cached

Stanford professor and CRFM director Percy Liang is highlighted for his influential work on AI model evaluation through HELM.

0 favorites 0 likes

#ai-evaluation

How evals drive the next chapter in AI for businesses

OpenAI Blog ↗ · 2025-11-19 Cached

OpenAI publishes a framework for business leaders on using AI evaluations (evals) to measure and improve AI system performance in organizational contexts, distinguishing between frontier evals for model development and contextual evals tailored to specific business workflows.

0 favorites 0 likes

#ai-evaluation

Introducing IndQA

OpenAI Blog ↗ · 2025-11-03 Cached

OpenAI introduced IndQA, a new benchmark with 2,278 questions across 12 Indian languages and 10 cultural domains, designed to evaluate AI models' understanding of culturally nuanced and reasoning-heavy tasks that existing benchmarks fail to capture. Created with 261 domain experts, IndQA addresses the saturation of existing multilingual benchmarks like MMMLU and focuses on real-world cultural comprehension rather than translation or multiple-choice tasks.

0 favorites 0 likes

#ai-evaluation

PaperBench: Evaluating AI’s Ability to Replicate AI Research

OpenAI Blog ↗ · 2025-04-02 Cached

OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.

0 favorites 0 likes

#ai-evaluation

Introducing the SWE-Lancer benchmark

OpenAI Blog ↗ · 2025-02-18 Cached

OpenAI introduces SWE-Lancer, a benchmark of over 1,400 real-world freelance software engineering tasks from Upwork valued at $1 million USD, designed to evaluate AI model performance on practical engineering work and map model capabilities to economic value.

0 favorites 0 likes

#ai-evaluation

Shaping the future of financial services

OpenAI Blog ↗ · 2024-12-04 Cached

Morgan Stanley has successfully deployed AI solutions powered by GPT-4 across its wealth management division, with over 98% of advisor teams using the internal AI Assistant chatbot. The deployment was enabled by a robust evaluation framework that tests AI performance on real-world use cases like document summarization and multilingual translation before production rollout.

0 favorites 0 likes

#ai-evaluation

May 7, 2026AlignmentDonating our open-source alignment tool

Anthropic Research ↗ · yesterday Cached

Anthropic is donating its open-source AI alignment tool, Petri, to Meridian Labs and releasing version 3.0 with improved adaptability and realism for testing large language models.

0 favorites 0 likes

ai-evaluation

Submit Feedback