SoRFT: サブタスク指向の強化学習によるファインチューニングを用いた課題解決

要旨

主流の課題解決フレームワークは主に商用モデルに依存しており、高コストやプライバシー懸念を引き起こしています。既存の課題解決のためのトレーニング手法は、汎化性能の低さに悩まされており、オープンソース開発リソースを十分に活用できていません。本論文では、大規模言語モデル（LLM）の課題解決能力を向上させるための新しいトレーニング手法である「サブタスク指向型強化学習ファインチューニング（SoRFT）」を提案します。SoRFTは、課題解決を構造化されたサブタスク（ファイル特定、関数特定、行特定、コード編集生成）に分解します。SoRFTは2段階のトレーニングで構成されます：(1) 拒否サンプリングによる教師ありファインチューニングでは、Chain of Thought（CoT）データをグラウンドトゥルースでフィルタリングしてからLLMをファインチューニングし、(2) ルールベースの強化学習では、グラウンドトゥルースに基づく報酬を用いたPPOを活用します。SoRFTでトレーニングしたモデルをSWE-Bench VerifiedおよびSWE-Bench Liteで評価し、オープンソースモデルの中で最先端（SOTA）の性能を達成しました（例：SoRFT-Qwen-7BでSWE-Bench Verifiedの21.4%の課題を解決）。実験結果は、SoRFTが課題解決性能を大幅に向上させ、モデルの汎化性能を改善し、商用モデルに比べてコスト効率の高い代替手段を提供することを示しています。

English

Mainstream issue-resolving frameworks predominantly rely on commercial models, leading to high costs and privacy concerns. Existing training approaches for issue resolving struggle with poor generalization and fail to fully leverage open-source development resources. We propose Subtask-oriented Reinforced Fine-Tuning (SoRFT), a novel training approach to enhance the issue resolving capability of LLMs. We decomposes issue resolving into structured subtasks: file localization, function localization, line localization, and code edit generation. SoRFT consists of two training stages: (1) rejection-sampled supervised fine-tuning, Chain of Thought (CoT) data is filtered using ground-truth before fine-tuning the LLM, and (2) rule-based reinforcement learning, which leverages PPO with ground-truth based rewards. We evaluate the SoRFT-trained model on SWE-Bench Verified and SWE-Bench Lite, achieving state-of-the-art (SOTA) performance among open-source models (e.g., resolve 21.4% issues on SWE-Bench Verified with SoRFT-Qwen-7B). The experimental results demonstrate that SoRFT significantly enhances issue-resolving performance, improves model generalization, and provides a cost-efficient alternative to commercial models.

SoRFT: サブタスク指向の強化学習によるファインチューニングを用いた課題解決

SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning

要旨

Support