DRIFT：在现实世界偏好学习中利用海量用户不满数据进行学习

摘要

现实世界中的大规模语言模型部署（如对话式AI系统、代码生成助手）自然会产生大量隐含的用户不满（DSAT）信号，因为用户通过反复修正、纠正和表达偏好来寻求更佳答案，而显式的满意度（SAT）反馈却相对稀缺。现有的偏好学习方法与这种数据特征并不契合，因为它们依赖于昂贵的人工标注或假设存在大量正面响应。本文提出了DRIFT（基于不满信号的迭代偏好训练），该方法以现实世界中的DSAT信号为训练锚点，并从不断演进的策略中动态采样正面样本。实证表明，基于真实世界WildFeedback数据集和合成UltraFeedback数据集训练的DRIFT模型，在WildBench任务评分上分别提升了+6.23%（7B）和+7.61%（14B），在AlpacaEval2胜率上分别提升了+8.95%（7B）和+12.29%（14B），超越了迭代DPO和SPIN等强基线方法。在更大规模上，改进尤为显著：采用DRIFT训练的14B模型在WildBench上超越了GPT-4o-mini。进一步分析显示，DRIFT还保持了探索能力，生成了更多样化的高奖励解决方案，而非局限于狭窄的子集。理论上，我们证明了这一设计保留了偏好边际，避免了梯度退化。这些结果表明，DRIFT是一种有效且可扩展的模型后训练方法，能够充分利用最丰富且信息量最大的信号。代码和数据可在https://github.com/cacayaya/DRIFT.git获取。

English

Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce DRIFT (Dissatisfaction-Refined Iterative preFerence Training), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world WildFeedback datasets and synthetic UltraFeedback datasets achieve up to +6.23\% (7B) / +7.61\% (14B) on WildBench Task Score and up to +8.95\% (7B) / +12.29\% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at https://github.com/cacayaya/DRIFT.git.

DRIFT：在现实世界偏好学习中利用海量用户不满数据进行学习

DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

摘要

Support