DRIFT：在現實世界偏好學習中從大量用戶不滿中學習

摘要

在現實世界中的大型語言模型部署（例如對話式AI系統、程式碼生成助手）自然會產生大量的隱性用戶不滿意（DSAT）信號，因為用戶會通過修正、改進和表達偏好來迭代獲得更佳答案，而顯性的滿意（SAT）反饋則相對稀缺。現有的偏好學習方法與這種數據特徵並不相符，因為它們依賴於昂貴的人工標註或假設存在大量的正面回應。本文介紹了DRIFT（基於不滿意信號的迭代偏好訓練），該方法以現實世界中的DSAT信號為訓練基礎，並從不斷演進的策略中動態採樣正面樣本。實驗表明，基於現實世界WildFeedback數據集和合成UltraFeedback數據集訓練的DRIFT模型，在WildBench任務分數上分別提升了+6.23%（7B）和+7.61%（14B），在AlpacaEval2勝率上分別提升了+8.95%（7B）和+12.29%（14B），超越了迭代DPO和SPIN等強基準方法。在更大規模的模型上，改進尤為顯著：使用DRIFT訓練的14B模型在WildBench上超越了GPT-4o-mini。進一步分析顯示，DRIFT還保留了探索能力，產生了更多樣化的高獎勵解決方案，而非局限於狹窄的子集。理論上，我們證明了這種設計保留了偏好邊際，避免了梯度退化。這些結果表明，DRIFT是一種有效且可擴展的現實世界後訓練方法，能夠充分利用最豐富且最具信息量的信號。代碼和數據可在https://github.com/cacayaya/DRIFT.git獲取。

English

Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce DRIFT (Dissatisfaction-Refined Iterative preFerence Training), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world WildFeedback datasets and synthetic UltraFeedback datasets achieve up to +6.23\% (7B) / +7.61\% (14B) on WildBench Task Score and up to +8.95\% (7B) / +12.29\% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at https://github.com/cacayaya/DRIFT.git.

DRIFT：在現實世界偏好學習中從大量用戶不滿中學習

DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

摘要

Support