細緻的人類反饋為語言模型訓練帶來更好的獎勵

摘要

語言模型（LMs）常常表現出不良的文本生成行為，包括生成虛假、有毒或無關的輸出。最近，從人類反饋中進行強化學習（RLHF）已顯示出在解決這些問題方面具有潛力。然而，這種整體性反饋對於長文本輸出傳遞的信息有限；它並不指示輸出的哪些方面影響了用戶的偏好；例如，哪些部分包含了哪些類型的錯誤。在本文中，我們使用精細的人類反饋（例如，哪個句子是虛假的，哪個子句是無關的）作為明確的訓練信號。我們引入了Fine-Grained RLHF，這是一個框架，可以實現對兩個方面細粒度的獎勵函數進行訓練和學習：（1）密度，即在生成每個片段（例如一個句子）後提供獎勵；以及（2）整合與不同反饋類型相關的多個獎勵模型（例如，事實不正確、無關和信息不完整）。我們在去毒化和長文問答方面進行實驗，以說明使用這種獎勵函數進行學習如何提高性能，並得到自動和人工評估的支持。此外，我們展示了可以使用不同組合的精細獎勵模型來定制LM的行為。我們在https://FineGrainedRLHF.github.io 上發布了所有數據、收集的人類反饋和代碼。

English

Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs. Reinforcement learning from human feedback (RLHF) - where human preference judgments on LM outputs are transformed into a learning signal - has recently shown promise in addressing these issues. However, such holistic feedback conveys limited information on long text outputs; it does not indicate which aspects of the outputs influenced user preference; e.g., which parts contain what type(s) of errors. In this paper, we use fine-grained human feedback (e.g., which sentence is false, which sub-sentence is irrelevant) as an explicit training signal. We introduce Fine-Grained RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects: (1) density, providing a reward after every segment (e.g., a sentence) is generated; and (2) incorporating multiple reward models associated with different feedback types (e.g., factual incorrectness, irrelevance, and information incompleteness). We conduct experiments on detoxification and long-form question answering to illustrate how learning with such reward functions leads to improved performance, supported by both automatic and human evaluation. Additionally, we show that LM behaviors can be customized using different combinations of fine-grained reward models. We release all data, collected human feedback, and codes at https://FineGrainedRLHF.github.io.

細緻的人類反饋為語言模型訓練帶來更好的獎勵

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

摘要

Support