SWE-Bench Pro: AIエージェントは長期にわたるソフトウェアエンジニアリングタスクを解決できるか？

要旨

私たちはSWE-Bench Proを紹介します。これはSWE-BENCH [25]のベストプラクティスを基に構築された、より挑戦的なベンチマークであり、SWE-BENCHの範囲を超えた現実的で複雑なエンタープライズレベルの問題を明示的に捉えるように設計されています。SWE-BENCH PROは、ビジネスアプリケーション、B2Bサービス、開発者ツールにまたがる41のアクティブにメンテナンスされているリポジトリから収集された1,865の問題を含んでいます。このベンチマークは、11のリポジトリから収集された問題にオープンアクセス可能な公開セット、12のリポジトリの保留セット、そして初期段階のスタートアップとの正式なパートナーシップ契約を持つ18のプロプライエタリリポジトリの商用セットに分割されています。保留セットと商用セットの問題は公開されていませんが、商用セットの結果を公開しています。私たちのベンチマークは、プロのソフトウェアエンジニアが完了するのに数時間から数日を要する長期的なタスクを特徴としており、しばしば複数のファイルにわたるパッチや大幅なコード変更を伴います。すべてのタスクは人間によって検証され、解決可能性を確保するために十分なコンテキストが追加されています。広く使用されているコーディングモデルの評価において、統一されたスキャフォールドの下で、SWE-Bench PROでのパフォーマンスは25%（Pass@1）を下回り、GPT-5がこれまでの最高スコアである23.3%を達成しました。これらの制限をより深く理解するために、収集されたエージェントの軌跡で観察された失敗モードをクラスタリングし、現在のモデルが示すエラーパターンをより明確に特徴付けました。全体として、SWE-BENCH PROは、現実世界のソフトウェア開発の複雑さと多様性をより忠実に捉えた、汚染に強いテストベッドを提供し、プロフェッショナルレベルでの真に自律的なソフトウェアエンジニアリングエージェントの追求を前進させます。

English

We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories spanning business applications, B2B services, and developer tools. The benchmark is partitioned into a public set with open access to problems sourced from 11 repositories, a held-out set of 12 repositories and a commercial set of 18 proprietary repositories where we have formal partnership agreements with early-stage startups. Problems in the held-out and the commercial set are not publicly accessible, but we release results on the commercial set. Our benchmark features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. All tasks are human-verified and augmented with sufficient context to ensure resolvability. In our evaluation of widely used coding models, under a unified scaffold, we observe that their performance on SWE-Bench PRO remains below 25% (Pass@1), with GPT-5 achieving the highest score to date at 23.3%. To better understand these limitations, we cluster the failure modes observed in the collected agent trajectories for a clearer characterization of the error patterns exhibited by current models. Overall, SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.

SWE-Bench Pro: AIエージェントは長期にわたるソフトウェアエンジニアリングタスクを解決できるか？

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

要旨

Support