SWE-Bench++: オープンソースリポジトリからのスケーラブルなソフトウェア工学ベンチマーク生成のためのフレームワーク

要旨

SWE-benchなどのベンチマークは、リポジトリレベルのソフトウェアエンジニアリングタスクにおける大規模言語モデル（LLM）の評価を標準化してきました。しかし、これらの取り組みは、手動によるキュレーション、静的なデータセット、Pythonベースのバグ修正への偏重といった制限に留まっています。本稿では、オープンソースのGitHubプロジェクトからリポジトリレベルのコーディングタスクを生成する自動化フレームワーク、SWE-Bench++を提案します。合成的手法とは異なり、本パイプラインは実際のプルリクエストを収集し、11のプログラミング言語にわたるバグ修正と機能追加の両方を網羅します。SWE-Bench++は、GitHubのプルリクエスト（PR）を、プログラムによる収集、環境合成、テストオラクル抽出、品質保証という4つのステージを経て、再現可能かつ実行ベースのタスクへと変換します。最後のヒント誘導軌道合成ステップでは、強力なモデルが失敗したインスタンスを学習用の軌道に変換します。我々の最初のベンチマークは、11の言語にわたる3,971のリポジトリから得られた11,133のインスタンスで構成されています。このベンチマークの1,782インスタンスからなるサブセットにおいて、現在最も強力なモデルの性能は以下の通りです：claude-sonnet-4.5が36.20% pass@10、gpt-5-2025-08-07が34.57%、gemini/gemini-2.5-proが24.92%、gpt-4oが16.89%を達成しました。さらに、SWE-Bench++のインスタンスでファインチューニングを行うことで、SWE-bench Multilingualベンチマークにおいて測定可能な改善が得られることを示し、データセットの有用性を実証します。SWE-Bench++は、リポジトリレベルのコード生成を評価し改善するための、スケーラブルで多言語対応のベンチマークを提供します。

English

Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on Python-based bug fixes. We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects. Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages. SWE-Bench++ turns GitHub pull requests (PRs) into reproducible, execution-based tasks via four stages: programmatic sourcing, environment synthesis, test oracle extraction, and quality assurance. A final hint-guided trajectory synthesis step converts instances that strong models fail on into training trajectories. Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages. On a subset of 1,782 instances of this benchmark, today's strongest models perform as follows: claude-sonnet-4.5 achieves 36.20% pass@10, gpt-5-2025-08-07 34.57%, gemini/gemini-2.5-pro 24.92%, and gpt-4o 16.89%. We further demonstrate the utility of our dataset by showing that fine-tuning on SWE-Bench++ instances yields measurable improvements on the SWE-bench Multilingual benchmark. SWE-Bench++ provides a scalable, multilingual benchmark for evaluating and improving repository-level code generation.

SWE-Bench++: オープンソースリポジトリからのスケーラブルなソフトウェア工学ベンチマーク生成のためのフレームワーク

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

要旨

Support