SWE-RL: オープンソフトウェア進化における強化学習による大規模言語モデルの推論能力の向上

要旨

最近リリースされたDeepSeek-R1は、大規模言語モデル（LLM）の汎用的な推論能力を強化するための強化学習（RL）の多大な可能性を実証しました。DeepSeek-R1やその他の後続研究は主に競技プログラミングや数学問題へのRLの適用に焦点を当てていますが、本論文では、現実世界のソフトウェアエンジニアリング向けにRLベースのLLM推論をスケールする初のアプローチであるSWE-RLを紹介します。軽量なルールベースの報酬（例えば、正解とLLMが生成したソリューション間の類似度スコア）を活用することで、SWE-RLはLLMがオープンソースソフトウェアの進化データ（ソフトウェアのライフサイクル全体の記録、コードスナップショット、コード変更、イシューやプルリクエストなどのイベントを含む）から学習し、開発者の推論プロセスとソリューションを自律的に復元することを可能にします。Llama 3をベースにトレーニングされた我々の推論モデル、Llama3-SWE-RL-70Bは、SWE-bench Verified（現実世界のGitHubイシューの人間による検証済みコレクション）において41.0%の解決率を達成しました。我々の知る限り、これは中規模（<100B）LLMにおいてこれまで報告された最高の性能であり、GPT-4oのような主要なプロプライエタリLLMにも匹敵するものです。驚くべきことに、ソフトウェア進化データのみでRLを行ったにもかかわらず、Llama3-SWE-RLは汎用的な推論スキルも獲得しました。例えば、関数コーディング、ライブラリ使用、コード推論、数学、一般的な言語理解という5つのドメイン外タスクにおいて改善された結果を示し、一方で教師ありファインチューニングのベースラインは平均的に性能低下を招きました。全体として、SWE-RLは大規模なソフトウェアエンジニアリングデータを用いた強化学習を通じてLLMの推論能力を向上させる新たな方向性を切り開きました。

English

The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.

SWE-RL: オープンソフトウェア進化における強化学習による大規模言語モデルの推論能力の向上

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

要旨

Support