SWE-RL: 오픈 소스 소프트웨어 진화에 대한 강화 학습을 통한 LLM 추론 능력 향상

초록

최근 DeepSeek-R1 릴리스는 강화 학습(RL)이 대규모 언어 모델(LLM)의 일반적인 추론 능력을 향상시키는 데 있어 엄청난 잠재력을 보여주었습니다. DeepSeek-R1과 후속 연구들이 주로 경쟁 프로그래밍 및 수학 문제에 RL을 적용하는 데 초점을 맞추는 반면, 본 논문은 실세계 소프트웨어 엔지니어링을 위해 RL 기반 LLM 추론을 확장하는 첫 번째 접근법인 SWE-RL을 소개합니다. 경량 규칙 기반 보상(예: 실제 정답과 LLM이 생성한 솔루션 간의 유사도 점수)을 활용함으로써, SWE-RL은 LLM이 방대한 오픈소스 소프트웨어 진화 데이터(소프트웨어의 전체 생명주기를 기록한 코드 스냅샷, 코드 변경, 이슈 및 풀 리퀘스트와 같은 이벤트 포함)로부터 학습하여 개발자의 추론 과정과 솔루션을 자율적으로 복구할 수 있도록 합니다. Llama 3를 기반으로 학습된 우리의 추론 모델인 Llama3-SWE-RL-70B는 SWE-bench Verified(실제 GitHub 이슈를 인간이 검증한 데이터셋)에서 41.0%의 해결률을 달성했습니다. 우리가 아는 한, 이는 중간 규모(<100B) LLM 중에서 지금까지 보고된 최고 성능이며, GPT-4o와 같은 선도적인 독점 LLM과도 비교 가능한 수준입니다. 흥미롭게도, 소프트웨어 진화 데이터에만 RL을 수행했음에도 불구하고 Llama3-SWE-RL은 일반화된 추론 능력을 보여주었습니다. 예를 들어, 함수 코딩, 라이브러리 사용, 코드 추론, 수학, 일반 언어 이해 등 다섯 가지 도메인 외 작업에서 개선된 결과를 보인 반면, 지도 학습 기반 파인튜닝 모델은 평균적으로 성능 저하를 초래했습니다. 전반적으로, SWE-RL은 대규모 소프트웨어 엔지니어링 데이터에 대한 강화 학습을 통해 LLM의 추론 능력을 개선하는 새로운 방향을 제시합니다.

English

The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.

SWE-RL: 오픈 소스 소프트웨어 진화에 대한 강화 학습을 통한 LLM 추론 능력 향상

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

초록

Support