長文問答的公理偏好建模

摘要

大型語言模型（LLMs）如 GPT-4 的卓越能力部分來自於後訓練過程，例如從人類反饋中進行的強化學習（RLHF），其中涉及人類偏好編碼在獎勵模型中。然而，這些獎勵模型（RMs）通常缺乏直接了解偏好標註是基於何種原因或原則的知識。在本研究中，我們確定指導獎勵模型更好地與人類偏好保持一致的原則，然後開發了一個公理框架來生成豐富多樣的偏好信號以支持這些原則。我們使用這些公理信號來訓練一個模型，用於對長篇問題的答案進行評分。我們的方法產生了一個偏好模型，僅約有 2.2 億個參數，比 GPT-4 更常與黃金人類標註的偏好標籤一致。這項工作的貢獻包括：訓練一個獨立的偏好模型，可以在相同尺度上對人類和LLM生成的答案進行評分；開發一個生成訓練數據對以符合特定原則的公理框架；並顯示少量公理信號可以幫助小型模型在偏好評分方面優於GPT-4。我們在 huggingface 上釋出我們的模型：https://huggingface.co/corbyrosset/axiomatic_preference_model

English

The remarkable abilities of large language models (LLMs) like GPT-4 partially stem from post-training processes like Reinforcement Learning from Human Feedback (RLHF) involving human preferences encoded in a reward model. However, these reward models (RMs) often lack direct knowledge of why, or under what principles, the preferences annotations were made. In this study, we identify principles that guide RMs to better align with human preferences, and then develop an axiomatic framework to generate a rich variety of preference signals to uphold them. We use these axiomatic signals to train a model for scoring answers to longform questions. Our approach yields a Preference Model with only about 220M parameters that agrees with gold human-annotated preference labels more often than GPT-4. The contributions of this work include: training a standalone preference model that can score human- and LLM-generated answers on the same scale; developing an axiomatic framework for generating training data pairs tailored to certain principles; and showing that a small amount of axiomatic signals can help small models outperform GPT-4 in preference scoring. We release our model on huggingface: https://huggingface.co/corbyrosset/axiomatic_preference_model

長文問答的公理偏好建模

Axiomatic Preference Modeling for Longform Question Answering

摘要

Support