Skywork-Reward-V2:通过人机协同扩展偏好数据精炼
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
July 2, 2025
作者: Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou
cs.AI
摘要
尽管奖励模型(RMs)在基于人类反馈的强化学习(RLHF)中扮演着关键角色,但当前最先进的开放奖励模型在大多数现有评估基准上表现不佳,未能捕捉到人类偏好中微妙而复杂的多样性。即便是那些融入了先进训练技术的方法,也未能带来显著的性能提升。我们推测,这种脆弱性主要源于偏好数据集的局限性,这些数据集往往范围狭窄、标签合成或缺乏严格的质量控制。为应对这些挑战,我们提出了一个包含4000万偏好对的大规模偏好数据集,命名为SynPref-40M。为了实现大规模数据整理,我们设计了一个人机协同的两阶段流程,充分利用了人类标注质量与AI可扩展性的互补优势。在这一流程中,人类提供经过验证的标注,而大型语言模型则基于人类指导进行自动整理。基于这一偏好混合数据训练,我们推出了Skywork-Reward-V2,这是一套包含从0.6B到8B参数的八种奖励模型,训练于从SynPref-40M中精心挑选的2600万偏好对子集。我们展示了Skywork-Reward-V2在多种能力上的广泛适用性,包括与人类偏好的一致性、客观正确性、安全性、对抗风格偏见的抵抗力以及最佳N选一扩展性,在七大奖励模型基准测试中均达到了业界领先水平。消融研究证实,我们方法的有效性不仅源于数据规模,还得益于高质量的数据整理。Skywork-Reward-V2系列标志着开放奖励模型领域的重大进展,揭示了现有偏好数据集的未开发潜力,并展示了人机协同整理如何能显著提升数据质量。
English
Despite the critical role of reward models (RMs) in reinforcement learning
from human feedback (RLHF), current state-of-the-art open RMs perform poorly on
most existing evaluation benchmarks, failing to capture the spectrum of nuanced
and sophisticated human preferences. Even approaches that incorporate advanced
training techniques have not yielded meaningful performance improvements. We
hypothesize that this brittleness stems primarily from limitations in
preference datasets, which are often narrowly scoped, synthetically labeled, or
lack rigorous quality control. To address these challenges, we present a
large-scale preference dataset comprising 40 million preference pairs, named
SynPref-40M. To enable data curation at scale, we design a human-AI synergistic
two-stage pipeline that leverages the complementary strengths of human
annotation quality and AI scalability. In this pipeline, humans provide
verified annotations, while large language models perform automatic curation
based on human guidance. Training on this preference mixture, we introduce
Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B
parameters, trained on a carefully curated subset of 26 million preference
pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile
across a wide range of capabilities, including alignment with human
preferences, objective correctness, safety, resistance to stylistic biases, and
best-of-N scaling, achieving state-of-the-art performance across seven major
reward model benchmarks. Ablation studies confirm that the effectiveness of our
approach stems not only from data scale but also from high-quality curation.
The Skywork-Reward-V2 series represents substantial progress in open reward
models, highlighting the untapped potential of existing preference datasets and
demonstrating how human-AI curation synergy can unlock significantly higher
data quality.