DRIFT: 実世界の選好学習における豊富なユーザー不満からの学習

要旨

現実世界での大規模言語モデルの展開（例えば、会話型AIシステムやコード生成アシスタント）では、ユーザーが改良、修正、表明された選好を通じてより良い回答を目指す過程で、暗黙的なユーザー不満（DSAT）信号が自然に大量に生成されます。一方で、明示的な満足度（SAT）フィードバックは希少です。既存の選好学習アプローチは、このデータプロファイルにうまく適合していません。なぜなら、それらは高コストな人間のアノテーションに依存しているか、豊富な肯定的な応答を前提としているからです。本論文では、DRIFT（Dissatisfaction-Refined Iterative preFerence Training）を紹介します。DRIFTは、現実世界のDSAT信号に基づいてトレーニングをアンカーし、進化するポリシーから動的にポジティブサンプルを抽出します。実証的に、DRIFTモデルは、現実世界のWildFeedbackデータセットと合成のUltraFeedbackデータセットでトレーニングされ、WildBenchタスクスコアで最大+6.23%（7B）/ +7.61%（14B）、AlpacaEval2勝率で最大+8.95%（7B）/ +12.29%（14B）を達成し、反復DPOやSPINなどの強力なベースラインメソッドを上回ります。大規模なスケールでは、改善が特に顕著です：DRIFTでトレーニングされた14Bモデルは、WildBenchでGPT-4o-miniを凌駕します。さらなる分析により、DRIFTは探索能力を保持し、狭いサブセットに崩壊するのではなく、多様な高報酬ソリューションを生み出すことが示されています。理論的には、この設計が選好マージンを保持し、勾配の退化を回避することを示します。これらの結果は、DRIFTが最も豊富で有益な信号を活用する現実世界のポストトレーニングのための効果的でスケーラブルなレシピであることを示しています。コードとデータはhttps://github.com/cacayaya/DRIFT.gitで利用可能です。

English

Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce DRIFT (Dissatisfaction-Refined Iterative preFerence Training), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world WildFeedback datasets and synthetic UltraFeedback datasets achieve up to +6.23\% (7B) / +7.61\% (14B) on WildBench Task Score and up to +8.95\% (7B) / +12.29\% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at https://github.com/cacayaya/DRIFT.git.

DRIFT: 実世界の選好学習における豊富なユーザー不満からの学習

DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

要旨

Support