SWE-RL:透過開放軟體演化上的強化學習推進大型語言模型推理能力
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
February 25, 2025
作者: Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, Sida I. Wang
cs.AI
摘要
近期發布的DeepSeek-R1展示了強化學習(RL)在提升大型語言模型(LLMs)通用推理能力方面的巨大潛力。儘管DeepSeek-R1及其後續研究主要聚焦於將RL應用於競技編程和數學問題,本文則首次提出了SWE-RL,這是一種將基於RL的LLM推理擴展至現實世界軟件工程的方法。通過利用輕量級的基於規則的獎勵機制(例如,真實解決方案與LLM生成解決方案之間的相似度評分),SWE-RL使LLMs能夠從大量的開源軟件演化數據中自主恢復開發者的推理過程和解決方案——這些數據記錄了軟件的整個生命週期,包括代碼快照、代碼變更以及如問題和拉取請求等事件。在Llama 3基礎上訓練後,我們得到了一個名為Llama3-SWE-RL-70B的推理模型,其在SWE-bench Verified上達到了41.0%的解決率——這是一組經人工驗證的真實GitHub問題集。據我們所知,這是迄今為止中型(<100B)LLMs報告的最佳性能,甚至可與GPT-4o等領先的專有LLMs相媲美。令人驚訝的是,儘管僅在軟件演化數據上進行RL訓練,Llama3-SWE-RL還展現出了泛化的推理能力。例如,它在五個跨領域任務上表現出提升,包括函數編碼、庫使用、代碼推理、數學以及通用語言理解,而相比之下,基於監督微調的基線模型平均上甚至導致了性能下降。總體而言,SWE-RL為通過在大量軟件工程數據上進行強化學習來提升LLMs的推理能力開闢了一條新途徑。
English
The recent DeepSeek-R1 release has demonstrated the immense potential of
reinforcement learning (RL) in enhancing the general reasoning capabilities of
large language models (LLMs). While DeepSeek-R1 and other follow-up work
primarily focus on applying RL to competitive coding and math problems, this
paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for
real-world software engineering. Leveraging a lightweight rule-based reward
(e.g., the similarity score between ground-truth and LLM-generated solutions),
SWE-RL enables LLMs to autonomously recover a developer's reasoning processes
and solutions by learning from extensive open-source software evolution data --
the record of a software's entire lifecycle, including its code snapshots, code
changes, and events such as issues and pull requests. Trained on top of Llama
3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve
rate on SWE-bench Verified -- a human-verified collection of real-world GitHub
issues. To our knowledge, this is the best performance reported for
medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs
like GPT-4o. Surprisingly, despite performing RL solely on software evolution
data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For
example, it shows improved results on five out-of-domain tasks, namely,
function coding, library use, code reasoning, mathematics, and general language
understanding, whereas a supervised-finetuning baseline even leads to
performance degradation on average. Overall, SWE-RL opens up a new direction to
improve the reasoning capabilities of LLMs through reinforcement learning on
massive software engineering data.Summary
AI-Generated Summary