Multi-SWE-bench: 課題解決のための多言語ベンチマーク

要旨

課題解決タスクとは、コードベースを修正して特定の課題に対処するパッチを生成することです。しかし、SWE-benchなどの既存のベンチマークはほぼPythonに限定されており、多様なソフトウェアエコシステムにおける大規模言語モデル（LLM）の評価には不十分です。これを解決するため、我々はJava、TypeScript、JavaScript、Go、Rust、C、C++をカバーする多言語課題解決ベンチマーク「Multi-SWE-bench」を導入しました。このベンチマークには、2,456の候補から68人の専門家アノテーターが慎重に選定した1,632の高品質なインスタンスが含まれており、正確で信頼性の高い評価を提供できるようになっています。Multi-SWE-benchに基づき、我々は最先端のモデルを3つの代表的な手法（Agentless、SWE-agent、OpenHands）を用いて評価し、重要な実証的知見を含む包括的な分析を提示します。さらに、課題解決タスクのための大規模な強化学習（RL）トレーニングデータセットを構築することを目的とした「Multi-SWE-RL」オープンソースコミュニティを立ち上げました。最初の貢献として、7つのプログラミング言語にまたがる4,723の整然としたインスタンスを公開し、この分野におけるRL研究の基盤を築きました。さらに重要なことに、我々はデータ生成パイプライン全体と詳細なチュートリアルをオープンソース化し、オープンソースコミュニティが継続的に貢献し、データセットを拡大することを奨励しています。我々は、Multi-SWE-benchと成長を続けるMulti-SWE-RLコミュニティが、RLの可能性を最大限に引き出し、AGIの夜明けに一歩近づくための触媒となることを期待しています。

English

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.