DRIFT：解耦展开与重要性加权微调的高效多轮优化

摘要

大语言模型越来越多地部署于多轮交互场景中，用户或环境可迭代地提供轻量级反馈。然而，优化此类行为在实践中面临严峻困境：在线强化学习虽能有效处理多轮交互动态特性，但每次更新需生成完整修正轨迹，成本过高难以承受；而离线监督微调（SFT）虽高效，却面临分布偏移与行为崩溃问题。为此，我们创新性地提出DRIFT（解耦轨迹生成与重要性加权微调）框架，将KL正则化强化学习目标等价于重要性加权监督学习这一理论洞见付诸实践。DRIFT通过固定参考策略采样离线交互轨迹，推导基于回报的重要性权重，并在所得数据集上通过加权SFT优化策略，从而将轨迹生成与优化过程解耦。实验表明，DRIFT在多轮强化学习基准测试中达到或超越现有方法性能，同时保持标准监督微调的训练效率与简洁性。代码已开源：https://github.com/2020-qqtcg/DRIFT。

English

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.