ChatPaper.aiChatPaper

在線策略強化學習與離線策略專家相遇:通過動態權重調和監督微調與強化學習

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

August 15, 2025
作者: Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou
cs.AI

摘要

監督式微調(Supervised Fine-Tuning, SFT)與強化學習(Reinforcement Learning, RL)是兩種重要的後訓練範式,用於精煉大型語言模型(Large Language Models, LLMs)的能力並對齊其行為。現有整合SFT與RL的方法常面臨破壞已建立模型模式及過度擬合專家數據的風險。為解決此問題,我們提出了一種新穎的研究,通過離策略與在策略的視角來統一審視SFT與RL。我們提出了CHORD框架,即通過動態加權實現可控的在策略與離策略強化學習的協調,該框架將SFT重新定義為在策略RL過程中的一個動態加權輔助目標,而非獨立階段。基於對離策略專家數據在整體與細粒度層面影響的分析,我們在CHORD中引入了雙重控制機制。具體而言,該框架首先採用全局係數來整體引導從離策略模仿到在策略探索的過渡,然後應用一個基於詞元的加權函數,使模型能夠從專家詞元中進行細粒度學習,這既保留了在策略探索,又減輕了離策略數據的干擾。我們在廣泛使用的基準上進行了大量實驗,提供了CHORD實現穩定且高效學習過程的實證證據。通過有效協調離策略專家數據與在策略探索,CHORD展現出相較於基線方法的顯著改進。我們在https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord發布了實現,以期激發進一步的研究。
English
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.
PDF65August 21, 2025