온-정책 강화 학습과 오프-정책 전문가의 만남: 동적 가중치를 통한 지도 미세 조정과 강화 학습의 조화

초록

지도 미세 조정(Supervised Fine-Tuning, SFT)과 강화 학습(Reinforcement Learning, RL)은 대규모 언어 모델(Large Language Models, LLMs)의 능력을 개선하고 행동을 조정하기 위한 두 가지 주요 사후 훈련 패러다임입니다. 기존의 SFT와 RL을 통합하는 접근법들은 종종 기존 모델 패턴을 교란하고 전문가 데이터에 과적합을 유발할 위험에 직면합니다. 이를 해결하기 위해, 우리는 오프-정책(off-policy) 대 온-정책(on-policy) 관점을 통해 SFT와 RL의 통합된 시각을 탐구하는 새로운 연구를 제시합니다. 우리는 CHORD(Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting)라는 프레임워크를 제안합니다. 이 프레임워크는 SFT를 별도의 단계가 아닌 온-정책 RL 프로세스 내에서 동적으로 가중치가 부여된 보조 목표로 재구성합니다. 오프-정책 전문가 데이터의 전체적 및 세부적 수준에서의 영향을 분석한 결과, 우리는 CHORD에 이중 제어 메커니즘을 통합했습니다. 구체적으로, 이 프레임워크는 먼저 전역 계수를 사용하여 오프-정책 모방에서 온-정책 탐색으로의 전환을 전체적으로 안내하고, 그런 다음 전문가 토큰으로부터 세부적인 학습을 가능하게 하는 토큰 단위 가중치 함수를 적용합니다. 이를 통해 온-정책 탐색을 보존하고 오프-정책 데이터로 인한 교란을 완화합니다. 우리는 널리 사용되는 벤치마크에서 광범위한 실험을 수행하여 CHORD가 안정적이고 효율적인 학습 과정을 달성한다는 경험적 증거를 제시합니다. 오프-정책 전문가 데이터와 온-정책 탐색을 효과적으로 조화시킴으로써, CHORD는 기준선 대비 상당한 개선을 보여줍니다. 우리는 추가 연구를 촉진하기 위해 구현을 https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord 에 공개합니다.

English

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.

온-정책 강화 학습과 오프-정책 전문가의 만남: 동적 가중치를 통한 지도 미세 조정과 강화 학습의 조화

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

초록

Support