HelpSteer2：用于训练表现优异的奖励模型的开源数据集

摘要

为训练能够有效引导大型语言模型（LLMs）生成与人类偏好一致的高质量响应的奖励模型，高质量的偏好数据集至关重要。随着LLMs变得更强大和更好地对齐，像Open Assistant、HH-RLHF和HelpSteer这样的许可宽松的偏好数据集需要更新以保持对奖励建模的有效性。从专有LLMs（如GPT-4）中提取偏好数据的方法受到模型提供者对商业使用的限制。为了提高生成的响应和属性标记质量，我们发布了HelpSteer2，这是一个许可宽松的偏好数据集（CC-BY-4.0）。利用在HelpSteer2上训练的强大内部基础模型，我们能够在Reward-Bench的主要数据集上实现SOTA得分（92.0%），超越了当前列出的开放和专有模型，截至2024年6月12日。值得注意的是，HelpSteer2仅包含一万个响应对，比现有偏好数据集（如HH-RLHF）少一个数量级，这使其非常适合训练奖励模型。我们的广泛实验表明，使用HelpSteer2训练的奖励模型在对齐LLMs方面是有效的。特别是，我们提出了SteerLM 2.0，这是一种模型对齐方法，可以有效利用我们的奖励模型预测的丰富多属性分数。HelpSteer2可在https://huggingface.co/datasets/nvidia/HelpSteer2获取，代码可在https://github.com/NVIDIA/NeMo-Aligner获取。

English

High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling. Methods that distil preference data from proprietary LLMs such as GPT-4 have restrictions on commercial usage imposed by model providers. To improve upon both generated responses and attribute labeling quality, we release HelpSteer2, a permissively licensed preference dataset (CC-BY-4.0). Using a powerful internal base model trained on HelpSteer2, we are able to achieve the SOTA score (92.0%) on Reward-Bench's primary dataset, outperforming currently listed open and proprietary models, as of June 12th, 2024. Notably, HelpSteer2 consists of only ten thousand response pairs, an order of magnitude fewer than existing preference datasets (e.g., HH-RLHF), which makes it highly efficient for training reward models. Our extensive experiments demonstrate that reward models trained with HelpSteer2 are effective in aligning LLMs. In particular, we propose SteerLM 2.0, a model alignment approach that can effectively make use of the rich multi-attribute score predicted by our reward models. HelpSteer2 is available at https://huggingface.co/datasets/nvidia/HelpSteer2 and code is available at https://github.com/NVIDIA/NeMo-Aligner

HelpSteer2：用于训练表现优异的奖励模型的开源数据集

HelpSteer2: Open-source dataset for training top-performing reward models

摘要

Support