NL2Repo-Bench:面向长周期代码库生成的编程智能体评估基准
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
December 14, 2025
作者: Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wang, Yishuo Yuan, Jiayu Zhang, Enduo Zhao, Yunfei Zhao, He Zhu, Chenyang Zou, Ming Ding, Jianpeng Jiao, Jiaheng Liu, Minghao Liu, Qian Liu, Chongyao Tao, Jian Yang, Tong Yang, Zhaoxiang Zhang, Xinjie Chen, Wenhao Huang, Ge Zhang
cs.AI
摘要
近期编码智能体的进展表明,我们正快速迈向自主软件开发,但现有基准测试未能严格评估构建完整软件系统所需的长期任务处理能力。此前的评估多聚焦于局部代码生成、框架式补全或短期修复任务,尚未解决智能体能否在真实代码库构建所需的长期跨度中保持连贯推理、规划与执行的核心问题。为填补这一空白,我们提出NL2Repo Bench基准测试,专门用于评估编码智能体的长跨度代码库生成能力。该测试仅提供单一自然语言需求文档和空工作区,要求智能体自主设计架构、管理依赖项、实现多模块逻辑,并最终生成可完整安装的Python库。我们对前沿开源与闭源模型的实验表明,长跨度代码库生成任务仍远未解决:即使最强智能体的平均测试通过率也低于40%,且极少能完整生成正确代码库。深入分析揭示了根本性的长跨度任务失效模式,包括过早终止、全局一致性缺失、脆弱的跨文件依赖关系,以及在数百个交互步骤中规划能力不足等问题。NL2Repo Bench为衡量持续性智能体能力建立了严格可验证的测试平台,并揭示出长跨度推理能力是制约新一代自主编码智能体发展的核心瓶颈。
English
Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.