DRIFT: 분리된 롤아웃과 중요도 가중 미세 조정을 통한 효율적인 다회 최적화

초록

대규모 언어 모델은 사용자나 환경이 반복적으로 가벼운 피드백을 제공할 수 있는 다중 턴 상호작용 환경에 점차 널리 배포되고 있다. 그러나 이러한 행동을 최적화하는 것은 실제로 뚜렷한 딜레마를 야기한다. 온라인 강화 학습은 다중 턴 동역학을 효과적으로 처리할 수 있지만, 매 갱신마다 전체 교정 궤적을 생성하는 비용이 과도하게 높아 실용적이지 않은 반면, 오프라인 지도 학습 기반 미세 조정(SFT)은 효율적이지만 분포 변화와 행동 붕괴를 겪는다. 이에 본 연구는 KL 정규화된 강화 학습 목적 함수가 중요도 가중 지도 학습과 동등하다는 이론적 통찰을 구현하는 프레임워크인 DRIFT(Decoupled Rollouts and Importance-Weighted Fine-Tuning, 분리된 롤아웃 및 중요도 가중 미세 조정)를 새롭게 제안한다. DRIFT는 고정된 참조 정책에서 오프라인 상호작용 궤적을 샘플링하고, 반환 기반 중요도 가중치를 도출한 후, 결과 데이터셋에 가중치가 적용된 SFT를 통해 정책을 최적화함으로써 롤아웃과 최적화를 분리한다. 실험적으로, DRIFT는 표준 지도 학습 기반 미세 조정의 훈련 효율성과 단순성을 유지하면서 다중 턴 강화 학습 기준선의 성능과 일치하거나 이를 능가함을 입증한다. 코드는 https://github.com/2020-qqtcg/DRIFT 에서 확인할 수 있다.

English

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.