细粒度人类反馈为语言模型训练带来更好的奖励

摘要

语言模型（LMs）经常表现出不良的文本生成行为，包括生成虚假、有毒或无关的输出。最近，从人类反馈中进行强化学习（RLHF）——即将人类对LM输出的偏好判断转化为学习信号——已显示出在解决这些问题方面的潜力。然而，这种整体反馈在长文本输出上传达的信息有限；它并不表明输出的哪些方面影响了用户的偏好；例如，哪些部分包含了哪种类型的错误。在本文中，我们使用细粒度人类反馈（例如，哪个句子是虚假的，哪个子句是无关的）作为明确的训练信号。我们引入了细粒度RLHF，这是一个框架，可以使训练和学习来自两个方面细粒度的奖励函数：（1）密度，即在生成每个片段（例如，一个句子）后提供奖励；以及（2）结合与不同反馈类型相关的多个奖励模型（例如，事实不正确、无关和信息不完整）。我们进行了解毒和长格式问答的实验，以说明使用这种奖励函数进行学习如何提高性能，得到了自动和人工评估的支持。此外，我们展示了可以使用不同组合的细粒度奖励模型定制LM行为。我们在https://FineGrainedRLHF.github.io 上发布了所有数据、收集的人类反馈和代码。

English

Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs. Reinforcement learning from human feedback (RLHF) - where human preference judgments on LM outputs are transformed into a learning signal - has recently shown promise in addressing these issues. However, such holistic feedback conveys limited information on long text outputs; it does not indicate which aspects of the outputs influenced user preference; e.g., which parts contain what type(s) of errors. In this paper, we use fine-grained human feedback (e.g., which sentence is false, which sub-sentence is irrelevant) as an explicit training signal. We introduce Fine-Grained RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects: (1) density, providing a reward after every segment (e.g., a sentence) is generated; and (2) incorporating multiple reward models associated with different feedback types (e.g., factual incorrectness, irrelevance, and information incompleteness). We conduct experiments on detoxification and long-form question answering to illustrate how learning with such reward functions leads to improved performance, supported by both automatic and human evaluation. Additionally, we show that LM behaviors can be customized using different combinations of fine-grained reward models. We release all data, collected human feedback, and codes at https://FineGrainedRLHF.github.io.

细粒度人类反馈为语言模型训练带来更好的奖励

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

摘要

Support