MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Summary
OpenAI introduces MLE-bench, a benchmark of 75 Kaggle ML competitions to evaluate AI agents on real-world ML engineering tasks. The best setup, o1-preview with AIDE scaffolding, achieves at least a Kaggle bronze medal in 16.9% of competitions.
View Cached Full Text
Cached at: 04/20/26, 02:57 PM
Similar Articles
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
This paper introduces MLS-Bench, a benchmark designed to assess whether AI systems can invent generalizable and scalable machine learning methods rather than just performing engineering tuning.
@sherryyangML: Machine learning engineering (MLE) is the new agentic frontier. I'll be sharing our work on scaling RL for MLE agents a…
Two ICLR 2026 papers show how small RL-trained agents outperform frontier models on machine-learning engineering tasks and how MLE-Smith automatically scales MLE workloads.
SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
SkillLearnBench introduces the first benchmark for evaluating continual skill learning in LLM agents across 20 real-world tasks, revealing that no method dominates and scaling LLMs does not guarantee better skills.
@OkhayIea: Everyone's racing to build "AI scientists." So we asked a blunt question: Can today's best coding agents beat the publi…
Introduces NatureBench, a cross-disciplinary benchmark of 90 tasks from Nature papers to test AI coding agents, finding the best agent (Claude Opus 4.7) surpasses SOTA on only 17.8% of tasks and often succeeds by reducing science to supervised ML rather than genuine discovery.
MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs
MLUBench is a large-scale benchmark for lifelong unlearning in multimodal large language models (MLLMs), featuring 127 entities across 9 classes. The paper identifies that existing unlearning methods suffer from cumulative degradation and proposes LUMoE to mitigate this, showing significant improvements.