DRIFT: デカップルド・ロールアウトと重要度重み付きファインチューニングによる効率的なマルチターン最適化

要旨

大規模言語モデルは、ユーザーや環境が繰り返し軽量なフィードバックを提供できるマルチターン対話設定において、ますます展開されている。残念ながら、このような振る舞いの最適化は実際上、深刻なジレンマを呈する。すなわち、オンライン強化学習はマルチターンの動的な相互作用を効果的に扱えるが、更新のたびに完全な修正軌跡を生成するコストがかかるため極めて高価であるのに対し、オフライン教師ありファインチューニング（SFT）は効率的であるものの、分布シフトや行動崩壊に悩まされる。この課題に対し、我々はDRIFT（Decoupled Rollouts and Importance-Weighted Fine-Tuning）を新たに提案する。これは、KL正則化されたRL目的関数が重要度重み付き教師あり学習と等価であるという理論的洞察を実運用化するフレームワークである。DRIFTは、固定参照ポリシーからオフライン対話軌跡をサンプリングし、リターンベースの重要度重みを導出し、得られたデータセットに対する重み付きSFTによりポリシーを最適化することで、ロールアウトと最適化を分離する。実験的には、DRIFTが標準的な教師ありファインチューニングの訓練効率と単純性を維持しつつ、マルチターン強化学習ベースラインと同等またはそれを上回る性能を達成することを示す。コードはhttps://github.com/2020-qqtcg/DRIFTで入手可能である。

English

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.