MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

OpenAI Blog 10/10/24, 10:00 AM Papers

benchmark ai-agents machine-learning-engineering evaluation kaggle llm

Summary

OpenAI introduces MLE-bench, a benchmark of 75 Kaggle ML competitions to evaluate AI agents on real-world ML engineering tasks. The best setup, o1-preview with AIDE scaffolding, achieves at least a Kaggle bronze medal in 16.9% of competitions.

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/20/26, 02:57 PM

# MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering Source: [https://openai.com/index/mle-bench/](https://openai.com/index/mle-bench/) OpenAIEvaluating Machine Learning Agents on Machine Learning Engineering We introduce MLE\-bench, a benchmark for measuring how well AI agents perform at machine learning engineering\. To this end, we curate 75 ML engineering\-related competitions from Kaggle, creating a diverse set of challenging tasks that test real\-world ML engineering skills such as training models, preparing datasets, and running experiments\. We establish human baselines for each competition using Kaggle's publicly available leaderboards\. We use open\-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best\-performing setup — OpenAI's o1‑preview with AIDE scaffolding — achieves at least the level of a Kaggle bronze medal in 16\.9% of competitions\. In addition to our main results, we investigate various forms of resource\-scaling for AI agents and the impact of contamination from pre\-training\. We[open\-source our benchmark code⁠\(opens in a new window\)](https://github.com/openai/mle-bench/)to facilitate future research in understanding the ML engineering capabilities of AI agents\.

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Similar Articles

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

@sherryyangML: Machine learning engineering (MLE) is the new agentic frontier. I'll be sharing our work on scaling RL for MLE agents a…

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

@KLieret: You can evaluate on ProgramBench yourself: https://github.com/facebookresearch/ProgramBench/… We will open the leaderbo…

Submit Feedback

Similar Articles

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

@sherryyangML: Machine learning engineering (MLE) is the new agentic frontier. I'll be sharing our work on scaling RL for MLE agents a…

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

@KLieret: You can evaluate on ProgramBench yourself: https://github.com/facebookresearch/ProgramBench/… We will open the leaderbo…