长文问题回答的公理化偏好建模

摘要

大型语言模型（LLMs）如GPT-4的显著能力部分源自后期训练过程，如从人类反馈中进行强化学习（RLHF），其中包括编码在奖励模型中的人类偏好。然而，这些奖励模型（RMs）通常缺乏直接了解偏好注释是基于何种原因或原则的知识。在这项研究中，我们确定指导奖励模型更好地与人类偏好一致的原则，然后开发了一个公理框架，生成多样化的偏好信号以支持这些原则。我们使用这些公理信号训练一个模型，用于评分长篇问题的答案。我们的方法产生了一个只有约2.2亿参数的偏好模型，比GPT-4更频繁地与黄金人类注释的偏好标签一致。这项工作的贡献包括：训练一个独立的偏好模型，可以在相同尺度上评分人类和LLM生成的答案；开发一个生成针对特定原则的训练数据对的公理框架；以及展示少量公理信号可以帮助小模型在偏好评分方面胜过GPT-4。我们在huggingface上发布了我们的模型：https://huggingface.co/corbyrosset/axiomatic_preference_model

English

The remarkable abilities of large language models (LLMs) like GPT-4 partially stem from post-training processes like Reinforcement Learning from Human Feedback (RLHF) involving human preferences encoded in a reward model. However, these reward models (RMs) often lack direct knowledge of why, or under what principles, the preferences annotations were made. In this study, we identify principles that guide RMs to better align with human preferences, and then develop an axiomatic framework to generate a rich variety of preference signals to uphold them. We use these axiomatic signals to train a model for scoring answers to longform questions. Our approach yields a Preference Model with only about 220M parameters that agrees with gold human-annotated preference labels more often than GPT-4. The contributions of this work include: training a standalone preference model that can score human- and LLM-generated answers on the same scale; developing an axiomatic framework for generating training data pairs tailored to certain principles; and showing that a small amount of axiomatic signals can help small models outperform GPT-4 in preference scoring. We release our model on huggingface: https://huggingface.co/corbyrosset/axiomatic_preference_model

长文问题回答的公理化偏好建模

Axiomatic Preference Modeling for Longform Question Answering

摘要

Support