ChatPaper.aiChatPaper

MLE-bench:評估機器學習代理人在機器學習工程上的表現

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

October 9, 2024
作者: Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Mądry
cs.AI

摘要

我們介紹了 MLE-bench,這是一個用於衡量AI代理在機器學習工程方面表現的基準。為此,我們從Kaggle中精心挑選了75個與機器學習工程相關的競賽,創建了一組多樣且具有挑戰性的任務,測試真實世界的機器學習工程技能,如訓練模型、準備數據集和運行實驗。我們利用Kaggle公開的排行榜為每個競賽建立人類基準。我們使用開源代理支架來評估我們的基準上的幾個前沿語言模型,發現表現最佳的設置--OpenAI的o1-preview與AIDE支架--在16.9%的競賽中至少達到Kaggle銅牌水平。除了我們的主要結果外,我們還研究了AI代理的各種資源擴展形式以及來自預訓練的污染影響。我們開源了我們的基準代碼(github.com/openai/mle-bench/),以促進未來研究,了解AI代理的機器學習工程能力。
English
We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.

Summary

AI-Generated Summary

PDF62November 16, 2024