MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Summary
OpenAI introduces MLE-bench, a benchmark of 75 Kaggle ML competitions to evaluate AI agents on real-world ML engineering tasks. The best setup, o1-preview with AIDE scaffolding, achieves at least a Kaggle bronze medal in 16.9% of competitions.
View Cached Full Text
Cached at: 04/20/26, 02:57 PM
Similar Articles
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
This paper introduces MLS-Bench, a benchmark designed to assess whether AI systems can invent generalizable and scalable machine learning methods rather than just performing engineering tuning.
@sherryyangML: Machine learning engineering (MLE) is the new agentic frontier. I'll be sharing our work on scaling RL for MLE agents a…
Two ICLR 2026 papers show how small RL-trained agents outperform frontier models on machine-learning engineering tasks and how MLE-Smith automatically scales MLE workloads.
SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks
SkillLearnBench introduces the first benchmark for evaluating continual skill learning in LLM agents across 20 real-world tasks, revealing that no method dominates and scaling LLMs does not guarantee better skills.
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
AJ-Bench introduces a benchmark to evaluate Agent-as-a-Judge systems that interact with environments to verify agent behaviors across 155 tasks in search, data systems, and GUI domains.
@KLieret: You can evaluate on ProgramBench yourself: https://github.com/facebookresearch/ProgramBench/… We will open the leaderbo…
ProgramBench is a new benchmark that tests AI agents' ability to reconstruct a complete codebase from a compiled binary and its documentation. The leaderboard will open for submissions soon.