SWE工廠:您的自動化問題解決訓練數據與評估基準生成工廠
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks
June 12, 2025
作者: Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, Zibin Zheng
cs.AI
摘要
構建大規模數據集以應對GitHub問題解決任務,對於訓練和評估大型語言模型(LLMs)的軟件工程能力至關重要。然而,傳統創建此類基準的過程因其極具挑戰性和勞動密集性而聞名,尤其是在設置評估環境、評分測試結果及驗證任務實例的階段。本文中,我們提出了SWE-Factory,一個旨在解決這些挑戰的自動化流程。為應對這些問題,我們的流程整合了三個核心自動化組件。首先,我們引入了SWE-Builder,這是一個多代理系統,用於自動化評估環境的構建,它利用四個專門的代理在協作迭代循環中工作,並通過環境記憶池來提升效率。其次,我們提出了一種基於退出碼的標準化評分方法,消除了手動編寫自定義解析器的需求。最後,我們利用這些可靠的退出碼信號自動化了fail2pass驗證過程。在四種編程語言的671個問題上的實驗表明,我們的流程能有效構建有效的任務實例;例如,使用GPT-4.1-mini時,我們的SWE-Builder以每個實例0.045的成本構建了269個有效實例,而使用Gemini-2.5-flash時,它以最低的每個實例0.024成本達到了可比的性能。我們還展示了基於退出碼的評分相比人工檢查達到了100%的準確率,且我們的自動化fail2pass驗證達到了0.92的精確度和1.00的召回率。我們希望我們的自動化流程能加速收集大規模、高質量的GitHub問題解決數據集,用於訓練和評估。我們的代碼和數據集已發佈於https://github.com/DeepSoftwareAnalytics/swe-factory。
English
Constructing large-scale datasets for the GitHub issue resolution task is
crucial for both training and evaluating the software engineering capabilities
of Large Language Models (LLMs). However, the traditional process for creating
such benchmarks is notoriously challenging and labor-intensive, particularly in
the stages of setting up evaluation environments, grading test outcomes, and
validating task instances. In this paper, we propose SWE-Factory, an automated
pipeline designed to address these challenges. To tackle these issues, our
pipeline integrates three core automated components. First, we introduce
SWE-Builder, a multi-agent system that automates evaluation environment
construction, which employs four specialized agents that work in a
collaborative, iterative loop and leverages an environment memory pool to
enhance efficiency. Second, we introduce a standardized, exit-code-based
grading method that eliminates the need for manually writing custom parsers.
Finally, we automate the fail2pass validation process using these reliable exit
code signals. Experiments on 671 issues across four programming languages show
that our pipeline can effectively construct valid task instances; for example,
with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at 0.045 per
instance, while with Gemini-2.5-flash, it achieves comparable performance at
the lowest cost of 0.024 per instance. We also demonstrate that our
exit-code-based grading achieves 100% accuracy compared to manual inspection,
and our automated fail2pass validation reaches a precision of 0.92 and a recall
of 1.00. We hope our automated pipeline will accelerate the collection of
large-scale, high-quality GitHub issue resolution datasets for both training
and evaluation. Our code and datasets are released at
https://github.com/DeepSoftwareAnalytics/swe-factory.