RExBench：編碼代理能否自主實現人工智慧研究的擴展？

摘要

基於大型語言模型（LLMs）的代理在自主執行複雜軟體工程任務方面展現出潛力。此外，在開發能夠完成機器學習與自然科學研究流程部分工作的代理方面，也取得了進展。我們認為，研究擴展及其實現是此類系統的關鍵能力，並引入RExBench以支持對這一能力的評估。RExBench是一個包含12項現實研究實驗實現任務的基準，旨在探討尚未被實施的研究假設。每項任務均設置為對現有研究論文及代碼庫的擴展，並附有領域專家撰寫的指導說明。RExBench對數據污染具有魯棒性，並支持自動評估基礎設施，該設施執行代理輸出以判斷是否滿足成功標準。我們利用此基準評估了使用三種不同框架（aider、Claude Code和OpenHands）實現的九個LLM代理。我們發現，所有被評估的代理均未能自主實現大部分擴展。儘管在提供額外人工提示後成功率有所提升，但在該設定下的最佳表現仍低於40%。這表明，當前代理在無需大量人工指導的情況下處理現實研究擴展任務的能力仍有不足。

English

Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of 12 realistic research experiment implementation tasks that aim to investigate research hypotheses that have not previously been implemented. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination, and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate nine LLM agents implemented using three different frameworks: aider, Claude Code, and OpenHands. We find that all agents evaluated fail to autonomously implement the majority of the extensions. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 40%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.

RExBench：編碼代理能否自主實現人工智慧研究的擴展？

RExBench: Can coding agents autonomously implement AI research extensions?

摘要

Support