在线策略强化学习邂逅离线策略专家:通过动态权重协调监督微调与强化学习
On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
August 15, 2025
作者: Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou
cs.AI
摘要
监督微调(SFT)与强化学习(RL)是两大主流后训练范式,旨在提升大语言模型(LLMs)的能力并调整其行为。现有整合SFT与RL的方法常面临破坏已有模型模式及对专家数据过拟合的风险。为此,我们提出了一种新颖的研究视角,通过离策略与在策略的对比,统一审视SFT与RL。我们介绍了CHORD框架,即通过动态权重实现可控的在策略与离策略强化学习协调,该框架将SFT重新定义为在策略RL过程中的一个动态加权辅助目标,而非独立阶段。基于对离策略专家数据在整体与细粒度层面影响的分析,CHORD引入了双重控制机制。具体而言,框架首先采用全局系数整体引导从离策略模仿向在策略探索的过渡,随后应用基于词元的加权函数,允许从专家词元进行细粒度学习,既保留了在策略探索,又减轻了离策略数据的干扰。我们在广泛使用的基准上进行了大量实验,实证表明CHORD实现了稳定高效的学习过程。通过有效协调离策略专家数据与在策略探索,CHORD相较于基线方法展现出显著提升。我们已在https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord发布实现代码,以期激发更多研究。
English
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two
prominent post-training paradigms for refining the capabilities and aligning
the behavior of Large Language Models (LLMs). Existing approaches that
integrate SFT and RL often face the risk of disrupting established model
patterns and inducing overfitting to expert data. To address this, we present a
novel investigation into the unified view of SFT and RL through an off-policy
versus on-policy lens. We propose CHORD, a framework for the Controllable
Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic
Weighting, which reframes SFT not as a separate stage but as a dynamically
weighted auxiliary objective within the on-policy RL process. Based on an
analysis of off-policy expert data's influence at both holistic and granular
levels, we incorporate a dual-control mechanism in CHORD. Specifically, the
framework first employs a global coefficient to holistically guide the
transition from off-policy imitation to on-policy exploration, and then applies
a token-wise weighting function that enables granular learning from expert
tokens, which preserves on-policy exploration and mitigates disruption from
off-policy data. We conduct extensive experiments on widely used benchmarks,
providing empirical evidence that CHORD achieves a stable and efficient
learning process. By effectively harmonizing off-policy expert data with
on-policy exploration, CHORD demonstrates significant improvements over
baselines. We release the implementation at
https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to
inspire further research.