ChatPaper.aiChatPaper

通过优势模型和选择性重演稳定RLHF

Stabilizing RLHF through Advantage Model and Selective Rehearsal

September 18, 2023
作者: Baolin Peng, Linfeng Song, Ye Tian, Lifeng Jin, Haitao Mi, Dong Yu
cs.AI

摘要

大型语言模型(LLMs)已经彻底改变了自然语言处理,然而利用RLHF将这些模型与人类价值观和偏好保持一致仍然是一个重大挑战。这一挑战的特点是各种不稳定因素,比如奖励欺骗和灾难性遗忘。在这份技术报告中,我们提出了两项创新来稳定RLHF训练:1)优势模型,直接对优势分数进行建模,即额外奖励与预期奖励之间的差异,并调节跨任务的分数分布,以防止奖励欺骗。2)选择性复习,通过有策略地选择数据进行PPO训练和知识复习来减轻灾难性遗忘。我们在公开和专有数据集上进行的实验分析显示,所提出的方法不仅增加了RLHF训练的稳定性,还实现了更高的奖励分数和胜率。
English
Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences using RLHF remains a significant challenge. This challenge is characterized by various instabilities, such as reward hacking and catastrophic forgetting. In this technical report, we propose two innovations to stabilize RLHF training: 1) Advantage Model, which directly models advantage score i.e., extra reward compared to the expected rewards and regulates score distributions across tasks to prevent reward hacking. 2) Selective Rehearsal, which mitigates catastrophic forgetting by strategically selecting data for PPO training and knowledge rehearsing. Our experimental analysis on public and proprietary datasets reveals that the proposed methods not only increase stability in RLHF training but also achieve higher reward scores and win rates.
PDF111December 15, 2024