ChatPaper.aiChatPaper

天工奖赏第二版:通过人机协同扩展偏好数据精炼

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

July 2, 2025
作者: Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou
cs.AI

摘要

儘管獎勵模型(RMs)在基於人類反饋的強化學習(RLHF)中扮演著關鍵角色,當前最先進的開源RMs在大多數現有評估基準上表現不佳,未能捕捉到細膩且複雜的人類偏好光譜。即便是採用了先進訓練技術的方法,也未能帶來顯著的性能提升。我們假設這種脆弱性主要源於偏好數據集的局限性,這些數據集往往範圍狹窄、標籤合成或缺乏嚴格的質量控制。為應對這些挑戰,我們提出了一個包含4000萬偏好對的大規模偏好數據集,命名為SynPref-40M。為了實現大規模數據策展,我們設計了一個人機協同的兩階段流程,該流程結合了人類註釋質量和AI可擴展性的互補優勢。在此流程中,人類提供經過驗證的註釋,而大型語言模型則基於人類指導進行自動策展。基於這一偏好混合數據進行訓練,我們推出了Skywork-Reward-V2,這是一套包含從0.6B到8B參數的八個獎勵模型,它們在從SynPref-40M中精心策展的2600萬偏好對子集上進行訓練。我們展示了Skywork-Reward-V2在多種能力上的廣泛適用性,包括與人類偏好的一致性、客觀正確性、安全性、對風格偏見的抵抗力以及最佳N項擴展,在七大主要獎勵模型基準上達到了最先進的性能。消融研究證實,我們方法的有效性不僅源於數據規模,還得益於高質量的策展。Skywork-Reward-V2系列代表了開源獎勵模型的重大進步,凸顯了現有偏好數據集的未開發潛力,並展示了人機協同策展如何能顯著提升數據質量。
English
Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.
PDF296July 4, 2025