WildReward:从真实人类互动中学习奖励模型
WildReward: Learning Reward Models from In-the-Wild Human Interactions
February 9, 2026
作者: Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Lei Hou, Juanzi Li
cs.AI
摘要
奖励模型(RM)在大语言模型(LLM)训练中至关重要,但传统方法通常依赖大规模人工标注的偏好对。随着LLM的广泛部署,真实场景下的用户交互已成为隐式奖励信号的丰富来源。这引发了一个关键问题:能否直接从真实交互数据中构建奖励模型?本研究通过采用WildChat作为交互数据源,提出了一套从用户反馈中提取可靠人类偏好的流程,无需偏好对即可通过序数回归方法生成18.6万高质量训练样本,进而训练出WildReward模型。大量实验表明,WildReward在保持更优校准特性和跨样本一致性的同时,达到了与传统奖励模型相当甚至更优的性能。研究还发现,WildReward的性能直接受益于用户多样性——用户基数越大,奖励模型越强。最终,我们将WildReward应用于在线DPO训练,在多项任务中均观察到显著提升。代码与数据已发布于https://github.com/THU-KEG/WildReward。
English
Reward models (RMs) are crucial for the training of large language models (LLMs), yet they typically rely on large-scale human-annotated preference pairs. With the widespread deployment of LLMs, in-the-wild interactions have emerged as a rich source of implicit reward signals. This raises the question: Can we develop reward models directly from in-the-wild interactions? In this work, we explore this possibility by adopting WildChat as an interaction source and proposing a pipeline to extract reliable human feedback, yielding 186k high-quality instances for training WildReward via ordinal regression directly on user feedback without preference pairs. Extensive experiments demonstrate that WildReward achieves comparable or even superior performance compared to conventional reward models, with improved calibration and cross-sample consistency. We also observe that WildReward benefits directly from user diversity, where more users yield stronger reward models. Finally, we apply WildReward to online DPO training and observe significant improvements across various tasks. Code and data are released at https://github.com/THU-KEG/WildReward.