SWE-Factory: 이슈 해결 훈련 데이터 및 평가 벤치마크를 위한 자동화된 팩토리

초록

GitHub 이슈 해결 작업을 위한 대규모 데이터셋 구축은 대형 언어 모델(LLM)의 소프트웨어 공학 역량을 훈련하고 평가하는 데 있어 매우 중요합니다. 그러나 전통적인 벤치마크 생성 과정은 평가 환경 설정, 테스트 결과 채점, 작업 인스턴스 검증 단계에서 특히 어렵고 노동 집약적인 것으로 알려져 있습니다. 본 논문에서는 이러한 문제를 해결하기 위해 SWE-Factory라는 자동화된 파이프라인을 제안합니다. 이 파이프라인은 세 가지 핵심 자동화 구성 요소를 통합합니다. 첫째, 평가 환경 구축을 자동화하는 다중 에이전트 시스템인 SWE-Builder를 소개합니다. 이 시스템은 네 가지 특화된 에이전트가 협력적이고 반복적인 루프에서 작동하며, 환경 메모리 풀을 활용하여 효율성을 높입니다. 둘째, 사용자 정의 파서를 수동으로 작성할 필요를 없애는 표준화된 종료 코드 기반 채점 방법을 도입합니다. 마지막으로, 이러한 신뢰할 수 있는 종료 코드 신호를 사용하여 fail2pass 검증 프로세스를 자동화합니다. 4가지 프로그래밍 언어에 걸친 671개 이슈에 대한 실험 결과, 우리의 파이프라인이 유효한 작업 인스턴스를 효과적으로 구축할 수 있음을 보여줍니다. 예를 들어, GPT-4.1-mini를 사용할 때 SWE-Builder는 인스턴스당 0.045의 비용으로 269개의 유효한 인스턴스를 구축하며, Gemini-2.5-flash를 사용할 때는 인스턴스당 최저 비용인 0.024로 비슷한 성능을 달성합니다. 또한, 종료 코드 기반 채점이 수동 검사와 비교하여 100% 정확도를 달성하고, 자동화된 fail2pass 검증은 0.92의 정밀도와 1.00의 재현율에 도달함을 입증합니다. 우리의 자동화된 파이프라인이 대규모 고품질 GitHub 이슈 해결 데이터셋의 수집을 가속화할 수 있기를 바랍니다. 우리의 코드와 데이터셋은 https://github.com/DeepSoftwareAnalytics/swe-factory에서 공개되었습니다.

English

Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at 0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of 0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.

SWE-Factory: 이슈 해결 훈련 데이터 및 평가 벤치마크를 위한 자동화된 팩토리

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

초록

Support