SWE-Bench Pro:AI代理能否解决长期软件工程任务?
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
September 21, 2025
作者: Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, Brad Kenstler
cs.AI
摘要
我们推出SWE-Bench Pro,这是一个显著更具挑战性的基准测试,它建立在SWE-BENCH[25]的最佳实践基础之上,但明确设计用于捕捉超出SWE-BENCH范围的现实、复杂、企业级问题。SWE-BENCH PRO包含来自41个活跃维护的多样化仓库的1,865个问题,涵盖商业应用、B2B服务和开发者工具。该基准测试被划分为一个公开集,包含来自11个仓库的开放访问问题;一个保留集,包含12个仓库;以及一个商业集,包含18个专有仓库,这些仓库与早期初创公司有正式合作协议。保留集和商业集中的问题不公开访问,但我们发布了商业集上的结果。我们的基准测试以长周期任务为特色,这些任务可能需要专业软件工程师数小时至数天完成,通常涉及跨多个文件的补丁和大量代码修改。所有任务均经过人工验证,并补充了足够的上下文以确保可解决性。在对广泛使用的编码模型进行统一框架下的评估中,我们观察到它们在SWE-Bench PRO上的表现仍低于25%(Pass@1),其中GPT-5以23.3%的成绩创下迄今为止的最高分。为了更好地理解这些局限性,我们对收集到的代理轨迹中的失败模式进行了聚类,以更清晰地描述当前模型表现出的错误模式。总体而言,SWE-BENCH PRO提供了一个抗污染测试平台,更真实地捕捉了现实世界软件开发的复杂性和多样性,推动了专业级真正自主软件工程代理的追求。
English
We introduce SWE-Bench Pro, a substantially more challenging benchmark that
builds upon the best practices of SWE-BENCH [25], but is explicitly designed to
capture realistic, complex, enterprise-level problems beyond the scope of
SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of
41 actively maintained repositories spanning business applications, B2B
services, and developer tools. The benchmark is partitioned into a public set
with open access to problems sourced from 11 repositories, a held-out set of 12
repositories and a commercial set of 18 proprietary repositories where we have
formal partnership agreements with early-stage startups. Problems in the
held-out and the commercial set are not publicly accessible, but we release
results on the commercial set. Our benchmark features long-horizon tasks that
may require hours to days for a professional software engineer to complete,
often involving patches across multiple files and substantial code
modifications. All tasks are human-verified and augmented with sufficient
context to ensure resolvability. In our evaluation of widely used coding
models, under a unified scaffold, we observe that their performance on
SWE-Bench PRO remains below 25% (Pass@1), with GPT-5 achieving the highest
score to date at 23.3%. To better understand these limitations, we cluster the
failure modes observed in the collected agent trajectories for a clearer
characterization of the error patterns exhibited by current models. Overall,
SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully
captures the complexity and diversity of real-world software development,
advancing the pursuit of truly autonomous software engineering agents at a
professional level.