ChatPaper.aiChatPaper

SWE-Factory:您的自动化工厂,专为问题解决训练数据与评估基准打造

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

June 12, 2025
作者: Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, Zibin Zheng
cs.AI

摘要

构建大规模数据集以应对GitHub问题解决任务,对于训练和评估大型语言模型(LLMs)的软件工程能力至关重要。然而,创建此类基准的传统过程因其在搭建评估环境、评分测试结果及验证任务实例等阶段的高难度与劳动密集性而闻名。本文中,我们提出了SWE-Factory,一个旨在解决这些挑战的自动化流水线。为此,我们的流水线整合了三大核心自动化组件。首先,我们引入了SWE-Builder,一个多代理系统,它自动化了评估环境的构建,该系统利用四个专业代理在协作迭代循环中工作,并借助环境内存池提升效率。其次,我们提出了一种基于退出代码的标准化评分方法,省去了手动编写自定义解析器的需求。最后,我们利用这些可靠的退出代码信号,自动化了fail2pass验证流程。在跨越四种编程语言的671个问题上的实验表明,我们的流水线能有效构建有效任务实例;例如,使用GPT-4.1-mini时,SWE-Builder以每个实例0.045的成本构建了269个有效实例,而使用Gemini-2.5-flash时,它以最低的每个实例0.024成本实现了相当的性能。我们还展示了基于退出代码的评分相较于人工检查达到了100%的准确率,且我们的自动化fail2pass验证达到了0.92的精确度和1.00的召回率。我们希望这一自动化流水线能加速大规模、高质量GitHub问题解决数据集的收集,服务于训练与评估。我们的代码和数据集已发布于https://github.com/DeepSoftwareAnalytics/swe-factory。
English
Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at 0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of 0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.
PDF412June 13, 2025