ChatPaper.aiChatPaper

HelpSteer2-Preference:用偏好补充评分

HelpSteer2-Preference: Complementing Ratings with Preferences

October 2, 2024
作者: Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, Yi Dong
cs.AI

摘要

奖励模型对于使模型遵循指令至关重要,通常根据两种流行范式之一进行训练:Bradley-Terry风格或回归风格。然而,缺乏证据表明其中一种方法优于另一种,当数据得到充分匹配时。这主要是因为这些方法需要以不同(但不兼容)格式收集的数据,这意味着在现有公共数据集中无法获得充分匹配的数据。为了解决这个问题,我们在HelpSteer2数据集中发布了偏好注释(专为Bradley-Terry训练而设计),以补充现有评分(专为回归风格训练而设计)。为了提高数据的可解释性,偏好注释附带有人类撰写的理由。利用这些数据,我们进行了首次对Bradley-Terry和回归模型进行充分匹配数据的正面比较。基于从这种比较中得出的见解,我们提出了一种结合Bradley-Terry和回归奖励建模的新方法。使用这种方法调整的Llama-3.1-70B-Instruct模型在RewardBench上得分为94.1,在2024年10月1日时超过140个奖励模型中名列前茅。我们还展示了这种奖励模型在RLHF中使模型遵循指令的有效性。我们在https://huggingface.co/datasets/nvidia/HelpSteer2开源了这个数据集(CC-BY-4.0许可),并公开发布了经过训练的奖励模型,网址为https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward。
English
Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. We also demonstrate the effectiveness of this reward model at aligning models to follow instructions in RLHF. We open-source this dataset (CC-BY-4.0 license) at https://huggingface.co/datasets/nvidia/HelpSteer2 and openly release the trained Reward Model at https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward

Summary

AI-Generated Summary

PDF245November 16, 2024