MLE-bench：在机器学习工程中评估机器学习代理

摘要

我们介绍了 MLE-bench，这是一个用于衡量AI代理在机器学习工程中表现如何的基准。为此，我们从Kaggle中整理了75个与机器学习工程相关的竞赛，创建了一组多样化的具有挑战性的任务，测试现实世界的机器学习工程技能，如训练模型、准备数据集和运行实验。我们利用Kaggle公开可用的排行榜为每个竞赛建立了人类基线。我们使用开源代理支架来评估我们的基准测试中的几种前沿语言模型，发现表现最佳的设置——OpenAI的o1-preview与AIDE支架——在16.9%的竞赛中至少达到了Kaggle铜牌的水平。除了我们的主要结果，我们还研究了AI代理的各种资源缩放形式以及来自预训练的污染的影响。我们开源了我们的基准测试代码（github.com/openai/mle-bench/），以促进未来研究理解AI代理的机器学习工程能力。

English

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.

MLE-bench：在机器学习工程中评估机器学习代理

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

摘要

Support