ChatPaper.aiChatPaper

Skywork-Reward-V2:通过人机协同扩展偏好数据精炼

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

July 2, 2025
作者: Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou
cs.AI

摘要

尽管奖励模型(RMs)在基于人类反馈的强化学习(RLHF)中扮演着关键角色,但当前最先进的开放奖励模型在大多数现有评估基准上表现不佳,未能捕捉到人类偏好中微妙而复杂的多样性。即便是那些融入了先进训练技术的方法,也未能带来显著的性能提升。我们推测,这种脆弱性主要源于偏好数据集的局限性,这些数据集往往范围狭窄、标签合成或缺乏严格的质量控制。为应对这些挑战,我们提出了一个包含4000万偏好对的大规模偏好数据集,命名为SynPref-40M。为了实现大规模数据整理,我们设计了一个人机协同的两阶段流程,充分利用了人类标注质量与AI可扩展性的互补优势。在这一流程中,人类提供经过验证的标注,而大型语言模型则基于人类指导进行自动整理。基于这一偏好混合数据训练,我们推出了Skywork-Reward-V2,这是一套包含从0.6B到8B参数的八种奖励模型,训练于从SynPref-40M中精心挑选的2600万偏好对子集。我们展示了Skywork-Reward-V2在多种能力上的广泛适用性,包括与人类偏好的一致性、客观正确性、安全性、对抗风格偏见的抵抗力以及最佳N选一扩展性,在七大奖励模型基准测试中均达到了业界领先水平。消融研究证实,我们方法的有效性不仅源于数据规模,还得益于高质量的数据整理。Skywork-Reward-V2系列标志着开放奖励模型领域的重大进展,揭示了现有偏好数据集的未开发潜力,并展示了人机协同整理如何能显著提升数据质量。
English
Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.
PDF296July 4, 2025