SUPER：评估代理程序在设置和执行来自研究存储库的任务时的表现

摘要

鉴于大型语言模型（LLMs）在编写代码方面取得了显著进展，它们是否可以被用于自主复现研究存储库中的结果？这种能力将对研究社区产生积极影响，帮助研究人员验证、理解和拓展先前的工作。为了朝着这个目标迈进，我们引入了SUPER，这是第一个旨在评估LLMs在设置和执行来自研究存储库任务方面能力的基准。SUPER旨在捕捉与处理机器学习（ML）和自然语言处理（NLP）研究存储库相关的现实挑战。我们的基准包括三个不同的问题集：45个具有注释专家解决方案的端到端问题，从专家集合中衍生的152个子问题，专注于特定挑战（例如，配置训练器），以及602个自动生成的用于更大规模开发的问题。我们引入了各种评估措施来评估任务成功和进展，利用金标准解决方案（如果可用）或其他情况下的近似解决方案。我们展示了最先进方法在解决这些问题时遇到困难，最佳模型（GPT-4o）仅解决了端到端集合的16.3％，以及46.1％的场景。这说明了这一任务的挑战，并表明SUPER可以作为社区制定和衡量进展的宝贵资源。

English

Given that Large Language Models (LLMs) have made significant progress in writing code, can they now be used to autonomously reproduce results from research repositories? Such a capability would be a boon to the research community, helping researchers validate, understand, and extend prior work. To advance towards this goal, we introduce SUPER, the first benchmark designed to evaluate the capability of LLMs in setting up and executing tasks from research repositories. SUPERaims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges (e.g., configuring a trainer), and 602 automatically generated problems for larger-scale development. We introduce various evaluation measures to assess both task success and progress, utilizing gold solutions when available or approximations otherwise. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios. This illustrates the challenge of this task, and suggests that SUPER can serve as a valuable resource for the community to make and measure progress.