ChatPaper.aiChatPaper

NL2Repo-Bench:面向程式碼代理長週期儲存庫生成的評估框架

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

December 14, 2025
作者: Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wang, Yishuo Yuan, Jiayu Zhang, Enduo Zhao, Yunfei Zhao, He Zhu, Chenyang Zou, Ming Ding, Jianpeng Jiao, Jiaheng Liu, Minghao Liu, Qian Liu, Chongyao Tao, Jian Yang, Tong Yang, Zhaoxiang Zhang, Xinjie Chen, Wenhao Huang, Ge Zhang
cs.AI

摘要

近期編程代理的進展顯示出朝向自主軟體開發的快速進步,然而現有基準測試未能嚴格評估構建完整軟體系統所需的長週期能力。多數現有評估側重於局部代碼生成、框架式補全或短期修復任務,未能解答代理能否在現實倉庫構建所需的延長週期中保持連貫推理、規劃與執行。為填補此空白,我們提出NL2Repo Bench——專為評估編程代理長週期倉庫生成能力設計的基準測試。僅需單一自然語言需求文檔與空工作區,代理必須自主設計架構、管理依賴項、實現多模塊邏輯,並產出可完整安裝的Python程式庫。我們對頂級開源與閉源模型的實驗表明,長週期倉庫生成仍屬未解難題:即使最強代理的平均測試通過率也低於40%,且極少能完整正確生成整個倉庫。細部分析揭示了根本性的長週期失效模式,包括過早終止、全局一致性喪失、脆弱的跨文件依賴關係,以及在數百個互動步驟中規劃不足等問題。NL2Repo Bench建立了可驗證的嚴謹測試平台,用於衡量持續代理能力,並凸顯長週期推理作為新一代自主編程代理的核心瓶頸。
English
Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.
PDF392December 17, 2025