SoRFT: 하위 작업 지향 강화 미세 조정을 통한 문제 해결

초록

주류 이슈 해결 프레임워크는 주로 상용 모델에 의존하여 높은 비용과 프라이버시 문제를 야기합니다. 기존의 이슈 해결을 위한 학습 접근법은 일반화가 부족하고 오픈소스 개발 자원을 충분히 활용하지 못하는 한계가 있습니다. 우리는 LLM(Large Language Model)의 이슈 해결 능력을 향상시키기 위한 새로운 학습 접근법인 Subtask-oriented Reinforced Fine-Tuning (SoRFT)를 제안합니다. 이슈 해결을 파일 위치 파악, 함수 위치 파악, 라인 위치 파악, 코드 수정 생성과 같은 구조화된 하위 작업으로 분해합니다. SoRFT는 두 단계의 학습 과정으로 구성됩니다: (1) 거부 샘플링 기반 지도 미세 조정, 여기서는 Chain of Thought (CoT) 데이터를 그라운드 트루스(ground-truth)를 사용해 필터링한 후 LLM을 미세 조정하고, (2) 규칙 기반 강화 학습, 이는 PPO(Proximal Policy Optimization)와 그라운드 트루스 기반 보상을 활용합니다. 우리는 SoRFT로 학습된 모델을 SWE-Bench Verified와 SWE-Bench Lite에서 평가하여 오픈소스 모델 중 최고의 성능(예: SoRFT-Qwen-7B로 SWE-Bench Verified에서 21.4% 이슈 해결)을 달성했습니다. 실험 결과는 SoRFT가 이슈 해결 성능을 크게 향상시키고, 모델의 일반화를 개선하며, 상용 모델에 비해 비용 효율적인 대안을 제공함을 보여줍니다.

English

Mainstream issue-resolving frameworks predominantly rely on commercial models, leading to high costs and privacy concerns. Existing training approaches for issue resolving struggle with poor generalization and fail to fully leverage open-source development resources. We propose Subtask-oriented Reinforced Fine-Tuning (SoRFT), a novel training approach to enhance the issue resolving capability of LLMs. We decomposes issue resolving into structured subtasks: file localization, function localization, line localization, and code edit generation. SoRFT consists of two training stages: (1) rejection-sampled supervised fine-tuning, Chain of Thought (CoT) data is filtered using ground-truth before fine-tuning the LLM, and (2) rule-based reinforcement learning, which leverages PPO with ground-truth based rewards. We evaluate the SoRFT-trained model on SWE-Bench Verified and SWE-Bench Lite, achieving state-of-the-art (SOTA) performance among open-source models (e.g., resolve 21.4% issues on SWE-Bench Verified with SoRFT-Qwen-7B). The experimental results demonstrate that SoRFT significantly enhances issue-resolving performance, improves model generalization, and provides a cost-efficient alternative to commercial models.

SoRFT: 하위 작업 지향 강화 미세 조정을 통한 문제 해결

SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning

초록

Support