세밀한 인간 피드백이 언어 모델 훈련을 위한 더 나은 보상을 제공한다

초록

언어 모델(LMs)은 종종 허위, 유해 또는 관련 없는 텍스트를 생성하는 바람직하지 않은 행동을 보입니다. 인간 피드백을 통한 강화 학습(RLHF) — 인간이 언어 모델의 출력에 대한 선호도를 학습 신호로 변환하는 방법 — 은 최근 이러한 문제를 해결하는 데 유망한 결과를 보여주었습니다. 그러나 이러한 종합적인 피드백은 긴 텍스트 출력에 대해 제한된 정보만을 전달하며, 사용자 선호도에 영향을 미친 출력의 어떤 측면이 문제인지(예: 어떤 부분이 어떤 유형의 오류를 포함하는지)를 명시하지 않습니다. 본 논문에서는 세분화된 인간 피드백(예: 어떤 문장이 거짓인지, 어떤 하위 문장이 관련 없는지)을 명시적인 학습 신호로 사용합니다. 우리는 Fine-Grained RLHF라는 프레임워크를 소개하며, 이는 두 가지 측면에서 세분화된 보상 함수를 통해 학습을 가능하게 합니다: (1) 밀도 — 각 세그먼트(예: 문장)가 생성된 후 보상을 제공하고, (2) 다양한 피드백 유형(예: 사실 오류, 관련성 부족, 정보 불완전성)과 연관된 다중 보상 모델을 통합합니다. 우리는 해독화 및 장문 질문 응답 실험을 통해 이러한 보상 함수를 사용한 학습이 자동 및 인간 평가를 통해 성능 향상으로 이어지는 것을 보여줍니다. 또한, 다양한 세분화된 보상 모델 조합을 통해 언어 모델의 행동을 사용자 정의할 수 있음을 보여줍니다. 모든 데이터, 수집된 인간 피드백 및 코드는 https://FineGrainedRLHF.github.io에서 공개합니다.

English

Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs. Reinforcement learning from human feedback (RLHF) - where human preference judgments on LM outputs are transformed into a learning signal - has recently shown promise in addressing these issues. However, such holistic feedback conveys limited information on long text outputs; it does not indicate which aspects of the outputs influenced user preference; e.g., which parts contain what type(s) of errors. In this paper, we use fine-grained human feedback (e.g., which sentence is false, which sub-sentence is irrelevant) as an explicit training signal. We introduce Fine-Grained RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects: (1) density, providing a reward after every segment (e.g., a sentence) is generated; and (2) incorporating multiple reward models associated with different feedback types (e.g., factual incorrectness, irrelevance, and information incompleteness). We conduct experiments on detoxification and long-form question answering to illustrate how learning with such reward functions leads to improved performance, supported by both automatic and human evaluation. Additionally, we show that LM behaviors can be customized using different combinations of fine-grained reward models. We release all data, collected human feedback, and codes at https://FineGrainedRLHF.github.io.

세밀한 인간 피드백이 언어 모델 훈련을 위한 더 나은 보상을 제공한다

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

초록

Support