RExBench：编码代理能否自主实现AI研究扩展？

摘要

基于大型语言模型（LLM）的智能体在自主执行复杂软件工程任务方面展现出潜力。此外，在开发能够完成机器学习和自然科学领域部分研究流程的智能体方面也取得了进展。我们认为，研究扩展及其实现是此类系统的关键能力，并为此引入了RExBench以支持对该能力的评估。RExBench是一个包含12项现实研究实验实现任务的基准，旨在探究尚未被实施的研究假设。每项任务均设置为对现有研究论文及代码库的扩展，并附有领域专家撰写的指导说明。RExBench对数据污染具有鲁棒性，并支持自动评估基础设施，通过执行智能体输出来判断是否满足成功标准。我们利用这一基准评估了使用三种不同框架（aider、Claude Code和OpenHands）实现的九个LLM智能体。研究发现，所有被评估的智能体均未能自主完成大部分扩展任务。尽管在提供额外人工提示后成功率有所提升，但在此设置下的最佳表现仍低于40%。这表明，当前智能体在处理现实研究扩展任务时，仍需大量人工指导，尚不具备完全自主的能力。

English

Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of 12 realistic research experiment implementation tasks that aim to investigate research hypotheses that have not previously been implemented. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination, and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate nine LLM agents implemented using three different frameworks: aider, Claude Code, and OpenHands. We find that all agents evaluated fail to autonomously implement the majority of the extensions. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 40%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.

RExBench：编码代理能否自主实现AI研究扩展？

RExBench: Can coding agents autonomously implement AI research extensions?

摘要

Support