HelpSteer2:用於訓練表現優異的獎勵模型的開源數據集
HelpSteer2: Open-source dataset for training top-performing reward models
June 12, 2024
作者: Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, Oleksii Kuchaiev
cs.AI
摘要
高質量的偏好數據集對於訓練獎勵模型至關重要,這些模型可以有效地引導大型語言模型(LLMs)生成符合人類偏好的高質量回應。隨著LLMs變得更強大且更加對齊,像是Open Assistant、HH-RLHF和HelpSteer這樣的許可權寬鬆的偏好數據集需要不斷更新,以保持對於獎勵建模的有效性。從專有LLMs(如GPT-4)中提煉偏好數據的方法受到模型提供者對商業使用的限制。為了改進生成的回應和屬性標記質量,我們發布了HelpSteer2,這是一個許可權寬鬆的偏好數據集(CC-BY-4.0)。通過在HelpSteer2上訓練的強大內部基礎模型,我們能夠在Reward-Bench的主要數據集上實現SOTA得分(92.0%),超越了截至2024年6月12日目前列出的開放和專有模型。值得注意的是,HelpSteer2僅包含一萬對回應,比現有偏好數據集(例如HH-RLHF)少了一個數量級,這使其非常高效用於訓練獎勵模型。我們的廣泛實驗表明,使用HelpSteer2訓練的獎勵模型能夠有效地對齊LLMs。特別是,我們提出了SteerLM 2.0,這是一種模型對齊方法,可以有效地利用我們的獎勵模型預測的豐富多屬性分數。HelpSteer2可在https://huggingface.co/datasets/nvidia/HelpSteer2取得,代碼可在https://github.com/NVIDIA/NeMo-Aligner找到。
English
High-quality preference datasets are essential for training reward models
that can effectively guide large language models (LLMs) in generating
high-quality responses aligned with human preferences. As LLMs become stronger
and better aligned, permissively licensed preference datasets, such as Open
Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for
reward modeling. Methods that distil preference data from proprietary LLMs such
as GPT-4 have restrictions on commercial usage imposed by model providers. To
improve upon both generated responses and attribute labeling quality, we
release HelpSteer2, a permissively licensed preference dataset (CC-BY-4.0).
Using a powerful internal base model trained on HelpSteer2, we are able to
achieve the SOTA score (92.0%) on Reward-Bench's primary dataset, outperforming
currently listed open and proprietary models, as of June 12th, 2024. Notably,
HelpSteer2 consists of only ten thousand response pairs, an order of magnitude
fewer than existing preference datasets (e.g., HH-RLHF), which makes it highly
efficient for training reward models. Our extensive experiments demonstrate
that reward models trained with HelpSteer2 are effective in aligning LLMs. In
particular, we propose SteerLM 2.0, a model alignment approach that can
effectively make use of the rich multi-attribute score predicted by our reward
models. HelpSteer2 is available at
https://huggingface.co/datasets/nvidia/HelpSteer2 and code is available at
https://github.com/NVIDIA/NeMo-AlignerSummary
AI-Generated Summary