ChatPaper.aiChatPaper

SWE-Bench Pro:AI代理能否解決長週期軟體工程任務?

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

September 21, 2025
作者: Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, Brad Kenstler
cs.AI

摘要

我們推出SWE-Bench Pro,這是一個更具挑戰性的基準測試,它建立在SWE-BENCH [25]的最佳實踐基礎上,但明確設計用於捕捉超出SWE-BENCH範圍的真實、複雜、企業級問題。SWE-BENCH PRO包含來自41個活躍維護的存儲庫的1,865個問題,這些存儲庫涵蓋了商業應用、B2B服務和開發者工具。該基準測試分為一個公開集,其中包含來自11個存儲庫的開放訪問問題,一個保留集包含12個存儲庫,以及一個商業集包含18個專有存儲庫,我們與早期初創公司有正式合作協議。保留集和商業集中的問題不公開訪問,但我們發布了商業集的結果。我們的基準測試以長時間任務為特色,這些任務可能需要專業軟件工程師數小時至數天才能完成,通常涉及跨多個文件的補丁和大量的代碼修改。所有任務均經過人工驗證,並附有足夠的上下文以確保可解決性。在我們對廣泛使用的編碼模型的評估中,在統一的框架下,我們觀察到它們在SWE-Bench PRO上的表現仍低於25%(Pass@1),其中GPT-5迄今為止取得了最高分,達到23.3%。為了更好地理解這些限制,我們對收集的代理軌跡中觀察到的失敗模式進行了聚類,以更清晰地描述當前模型表現出的錯誤模式。總體而言,SWE-BENCH PRO提供了一個抗污染的測試平台,更真實地捕捉了現實世界軟件開發的複雜性和多樣性,推動了真正自主的專業級軟件工程代理的追求。
English
We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories spanning business applications, B2B services, and developer tools. The benchmark is partitioned into a public set with open access to problems sourced from 11 repositories, a held-out set of 12 repositories and a commercial set of 18 proprietary repositories where we have formal partnership agreements with early-stage startups. Problems in the held-out and the commercial set are not publicly accessible, but we release results on the commercial set. Our benchmark features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. All tasks are human-verified and augmented with sufficient context to ensure resolvability. In our evaluation of widely used coding models, under a unified scaffold, we observe that their performance on SWE-Bench PRO remains below 25% (Pass@1), with GPT-5 achieving the highest score to date at 23.3%. To better understand these limitations, we cluster the failure modes observed in the collected agent trajectories for a clearer characterization of the error patterns exhibited by current models. Overall, SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.
PDF193September 23, 2025