SWE-Factory: 課題解決トレーニングデータと評価ベンチマークの自動化工場

要旨

GitHubの課題解決タスク向けに大規模なデータセットを構築することは、大規模言語モデル（LLM）のソフトウェアエンジニアリング能力を訓練・評価する上で極めて重要です。しかし、従来のベンチマーク作成プロセスは、特に評価環境の構築、テスト結果の採点、タスクインスタンスの検証といった段階において、非常に困難で労力を要するものでした。本論文では、これらの課題に対処するための自動化パイプラインであるSWE-Factoryを提案します。このパイプラインは、3つの主要な自動化コンポーネントを統合しています。まず、評価環境の構築を自動化するマルチエージェントシステムであるSWE-Builderを紹介します。このシステムは、4つの専門エージェントが協調的かつ反復的なループで動作し、環境メモリプールを活用して効率を向上させます。次に、カスタムパーサーを手動で作成する必要をなくす、標準化された終了コードベースの採点方法を導入します。最後に、信頼性の高い終了コード信号を用いて、fail2pass検証プロセスを自動化します。4つのプログラミング言語にわたる671の課題に対する実験では、本パイプラインが有効なタスクインスタンスを効果的に構築できることが示されました。例えば、GPT-4.1-miniを使用した場合、SWE-Builderは269の有効なインスタンスを1インスタンスあたり0.045のコストで構築し、Gemini-2.5-flashでは最低コストの1インスタンスあたり0.024で同等の性能を達成しました。また、終了コードベースの採点は手動検査と比較して100%の精度を達成し、自動化されたfail2pass検証は精度0.92、再現率1.00に到達しました。本自動化パイプラインが、訓練と評価のための大規模で高品質なGitHub課題解決データセットの収集を加速することを期待しています。コードとデータセットはhttps://github.com/DeepSoftwareAnalytics/swe-factoryで公開されています。

English

Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at 0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of 0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.

SWE-Factory: 課題解決トレーニングデータと評価ベンチマークの自動化工場

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

要旨

Support