DRIFT：解耦展開與重要性加權微調以實現高效多輪優化

摘要

大型語言模型日益部署於多方交互情境中，使用者或環境可反覆提供輕量級回饋。然而，此類行為的優化在實務上存在嚴峻困境：線上強化學習雖能有效處理多方動態，但因每次更新需生成完整修正軌跡而成本高昂；離線監督式微調雖具效率，卻面臨分布偏移與行為崩潰問題。為此，我們提出DRIFT（解耦展開與重要性加權微調）框架，將KL正則化強化學習目標等價於重要性加權監督式學習的理論洞見具體實踐。DRIFT透過從固定參考策略取樣離線交互軌跡、推導基於回報的重要性權重，並對所得資料集進行加權監督式微調，從而將展開與優化過程解耦。實驗結果顯示，DRIFT在維持標準監督式微調的訓練效率與簡潔性之際，能達到或超越多方強化學習基準的表現。程式碼已公開於 https://github.com/2020-qqtcg/DRIFT。

English

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.