SWE-Bench Pro：AI代理能否解決長週期軟體工程任務？

摘要

我們推出SWE-Bench Pro，這是一個更具挑戰性的基準測試，它建立在SWE-BENCH [25]的最佳實踐基礎上，但明確設計用於捕捉超出SWE-BENCH範圍的真實、複雜、企業級問題。SWE-BENCH PRO包含來自41個活躍維護的存儲庫的1,865個問題，這些存儲庫涵蓋了商業應用、B2B服務和開發者工具。該基準測試分為一個公開集，其中包含來自11個存儲庫的開放訪問問題，一個保留集包含12個存儲庫，以及一個商業集包含18個專有存儲庫，我們與早期初創公司有正式合作協議。保留集和商業集中的問題不公開訪問，但我們發布了商業集的結果。我們的基準測試以長時間任務為特色，這些任務可能需要專業軟件工程師數小時至數天才能完成，通常涉及跨多個文件的補丁和大量的代碼修改。所有任務均經過人工驗證，並附有足夠的上下文以確保可解決性。在我們對廣泛使用的編碼模型的評估中，在統一的框架下，我們觀察到它們在SWE-Bench PRO上的表現仍低於25%（Pass@1），其中GPT-5迄今為止取得了最高分，達到23.3%。為了更好地理解這些限制，我們對收集的代理軌跡中觀察到的失敗模式進行了聚類，以更清晰地描述當前模型表現出的錯誤模式。總體而言，SWE-BENCH PRO提供了一個抗污染的測試平台，更真實地捕捉了現實世界軟件開發的複雜性和多樣性，推動了真正自主的專業級軟件工程代理的追求。

English

We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories spanning business applications, B2B services, and developer tools. The benchmark is partitioned into a public set with open access to problems sourced from 11 repositories, a held-out set of 12 repositories and a commercial set of 18 proprietary repositories where we have formal partnership agreements with early-stage startups. Problems in the held-out and the commercial set are not publicly accessible, but we release results on the commercial set. Our benchmark features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. All tasks are human-verified and augmented with sufficient context to ensure resolvability. In our evaluation of widely used coding models, under a unified scaffold, we observe that their performance on SWE-Bench PRO remains below 25% (Pass@1), with GPT-5 achieving the highest score to date at 23.3%. To better understand these limitations, we cluster the failure modes observed in the collected agent trajectories for a clearer characterization of the error patterns exhibited by current models. Overall, SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.

SWE-Bench Pro：AI代理能否解決長週期軟體工程任務？

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

摘要

Support