Multi-SWE-bench: 이슈 해결을 위한 다국어 벤치마크

초록

이슈 해결 작업은 주어진 이슈를 해결하기 위해 코드베이스를 수정하고 패치를 생성하는 것을 목표로 합니다. 그러나 SWE-bench와 같은 기존 벤치마크는 거의 전적으로 Python에 초점을 맞추고 있어, 다양한 소프트웨어 생태계에서 대규모 언어 모델(LLMs)을 평가하기에는 부족합니다. 이를 해결하기 위해, 우리는 Java, TypeScript, JavaScript, Go, Rust, C, C++을 포함하는 다국어 이슈 해결 벤치마크인 Multi-SWE-bench를 소개합니다. 이 벤치마크는 총 1,632개의 고품질 인스턴스를 포함하며, 68명의 전문 어노테이터가 2,456개의 후보 중에서 신중하게 주석을 달아 정확하고 신뢰할 수 있는 평가를 제공할 수 있도록 했습니다. Multi-SWE-bench를 기반으로, 우리는 세 가지 대표적인 방법(Agentless, SWE-agent, OpenHands)을 사용하여 최신 모델들을 평가하고 주요 실증적 통찰을 포함한 포괄적인 분석을 제시합니다. 또한, 우리는 이슈 해결 작업을 위한 대규모 강화 학습(RL) 훈련 데이터셋을 구축하기 위한 목적으로 Multi-SWE-RL 오픈소스 커뮤니티를 출범시켰습니다. 초기 기여로, 우리는 7개 프로그래밍 언어에 걸친 4,723개의 잘 구조화된 인스턴스를 공개하여 이 분야의 RL 연구를 위한 견고한 기반을 마련했습니다. 더 중요한 것은, 우리는 전체 데이터 생산 파이프라인과 상세한 튜토리얼을 오픈소스로 공개하여, 오픈소스 커뮤니티가 지속적으로 기여하고 데이터셋을 확장할 수 있도록 장려합니다. 우리는 Multi-SWE-bench와 지속적으로 성장하는 Multi-SWE-RL 커뮤니티가 RL의 잠재력을 최대한 발휘하고, AGI(인공 일반 지능)의 새벽에 한 걸음 더 다가가는 데 촉매제가 되기를 기대합니다.

English

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.