SWE-Bench Pro：AI代理能否解决长期软件工程任务？

摘要

我们推出SWE-Bench Pro，这是一个显著更具挑战性的基准测试，它建立在SWE-BENCH[25]的最佳实践基础之上，但明确设计用于捕捉超出SWE-BENCH范围的现实、复杂、企业级问题。SWE-BENCH PRO包含来自41个活跃维护的多样化仓库的1,865个问题，涵盖商业应用、B2B服务和开发者工具。该基准测试被划分为一个公开集，包含来自11个仓库的开放访问问题；一个保留集，包含12个仓库；以及一个商业集，包含18个专有仓库，这些仓库与早期初创公司有正式合作协议。保留集和商业集中的问题不公开访问，但我们发布了商业集上的结果。我们的基准测试以长周期任务为特色，这些任务可能需要专业软件工程师数小时至数天完成，通常涉及跨多个文件的补丁和大量代码修改。所有任务均经过人工验证，并补充了足够的上下文以确保可解决性。在对广泛使用的编码模型进行统一框架下的评估中，我们观察到它们在SWE-Bench PRO上的表现仍低于25%（Pass@1），其中GPT-5以23.3%的成绩创下迄今为止的最高分。为了更好地理解这些局限性，我们对收集到的代理轨迹中的失败模式进行了聚类，以更清晰地描述当前模型表现出的错误模式。总体而言，SWE-BENCH PRO提供了一个抗污染测试平台，更真实地捕捉了现实世界软件开发的复杂性和多样性，推动了专业级真正自主软件工程代理的追求。

English

We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories spanning business applications, B2B services, and developer tools. The benchmark is partitioned into a public set with open access to problems sourced from 11 repositories, a held-out set of 12 repositories and a commercial set of 18 proprietary repositories where we have formal partnership agreements with early-stage startups. Problems in the held-out and the commercial set are not publicly accessible, but we release results on the commercial set. Our benchmark features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. All tasks are human-verified and augmented with sufficient context to ensure resolvability. In our evaluation of widely used coding models, under a unified scaffold, we observe that their performance on SWE-Bench PRO remains below 25% (Pass@1), with GPT-5 achieving the highest score to date at 23.3%. To better understand these limitations, we cluster the failure modes observed in the collected agent trajectories for a clearer characterization of the error patterns exhibited by current models. Overall, SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.

SWE-Bench Pro：AI代理能否解决长期软件工程任务？

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

摘要

Support