WildReward:基于真实人类交互的奖励模型学习
WildReward: Learning Reward Models from In-the-Wild Human Interactions
February 9, 2026
作者: Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Lei Hou, Juanzi Li
cs.AI
摘要
奖励模型(RMs)对大语言模型(LLMs)的训练至关重要,但传统方法通常依赖大规模人工标注的偏好对。随着LLMs的广泛部署,真实场景下的用户交互已成为隐式奖励信号的丰富来源。这引出一个关键问题:能否直接基于真实交互数据构建奖励模型?本研究通过采用WildChat作为交互数据源,提出一套从用户反馈中提取可靠人类偏好的流程,无需偏好对即可通过序数回归方法直接利用用户反馈训练WildReward模型,最终获得18.6万高质量训练实例。大量实验表明,WildReward在性能上达到甚至超越传统奖励模型,同时具备更优的校准特性和跨样本一致性。研究还发现,WildReward可直接受益于用户多样性——用户基数越大,奖励模型性能越强。最终,我们将WildReward应用于在线DPO训练,在多项任务中观察到显著性能提升。代码与数据已发布于https://github.com/THU-KEG/WildReward。
English
Reward models (RMs) are crucial for the training of large language models (LLMs), yet they typically rely on large-scale human-annotated preference pairs. With the widespread deployment of LLMs, in-the-wild interactions have emerged as a rich source of implicit reward signals. This raises the question: Can we develop reward models directly from in-the-wild interactions? In this work, we explore this possibility by adopting WildChat as an interaction source and proposing a pipeline to extract reliable human feedback, yielding 186k high-quality instances for training WildReward via ordinal regression directly on user feedback without preference pairs. Extensive experiments demonstrate that WildReward achieves comparable or even superior performance compared to conventional reward models, with improved calibration and cross-sample consistency. We also observe that WildReward benefits directly from user diversity, where more users yield stronger reward models. Finally, we apply WildReward to online DPO training and observe significant improvements across various tasks. Code and data are released at https://github.com/THU-KEG/WildReward.