オンポリシー強化学習とオフポリシー専門家の融合：動的重み付けによる教師ありファインチューニングと強化学習の調和

要旨

教師ありファインチューニング（SFT）と強化学習（RL）は、大規模言語モデル（LLM）の能力を洗練し、その振る舞いを調整するための2つの主要なポストトレーニングパラダイムです。既存のSFTとRLを統合するアプローチでは、確立されたモデルのパターンを破壊したり、専門家データへの過剰適合を引き起こすリスクがしばしば生じます。この問題に対処するため、我々はオフポリシーとオンポリシーの視点を通じてSFTとRLの統一的な見解を探る新たな研究を提示します。我々は、CHORD（Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting）というフレームワークを提案します。これは、SFTを独立した段階ではなく、オンポリシーRLプロセス内で動的に重み付けされる補助目的として再定義します。オフポリシーの専門家データが全体的および詳細なレベルに及ぼす影響を分析した上で、CHORDにデュアルコントロールメカニズムを組み込みます。具体的には、このフレームワークはまず、オフポリシー模倣からオンポリシー探索への移行を全体的に導くためのグローバル係数を採用し、次に、専門家トークンからの詳細な学習を可能にするトークン単位の重み付け関数を適用します。これにより、オンポリシー探索を維持しつつ、オフポリシーデータからの干渉を軽減します。我々は広く使用されているベンチマークで大規模な実験を行い、CHORDが安定かつ効率的な学習プロセスを実現することを実証的に示します。オフポリシーの専門家データとオンポリシー探索を効果的に調和させることで、CHORDはベースラインを大幅に上回る改善を示します。我々は、さらなる研究を促進するため、実装をhttps://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chordで公開しています。

English

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.

オンポリシー強化学習とオフポリシー専門家の融合：動的重み付けによる教師ありファインチューニングと強化学習の調和

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

要旨

Support