ChatPaper.aiChatPaper

通過優勢模型和選擇性重演來穩定RLHF

Stabilizing RLHF through Advantage Model and Selective Rehearsal

September 18, 2023
作者: Baolin Peng, Linfeng Song, Ye Tian, Lifeng Jin, Haitao Mi, Dong Yu
cs.AI

摘要

大型語言模型(LLMs)已經革新了自然語言處理,然而利用RLHF將這些模型與人類價值觀和偏好相一致仍然是一個重大挑戰。這個挑戰的特點是各種不穩定性,如獎勵破解和災難性遺忘。在這份技術報告中,我們提出了兩項創新來穩定RLHF訓練:1)優勢模型,直接建模優勢分數,即與預期獎勵相比的額外獎勵,並調節跨任務的分數分佈,以防止獎勵破解。2)選擇性重演,通過有策略地選擇數據進行PPO訓練和知識重演,減輕災難性遺忘。我們對公共和專有數據集進行的實驗分析顯示,所提出的方法不僅提高了RLHF訓練的穩定性,還實現了更高的獎勵分數和勝率。
English
Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences using RLHF remains a significant challenge. This challenge is characterized by various instabilities, such as reward hacking and catastrophic forgetting. In this technical report, we propose two innovations to stabilize RLHF training: 1) Advantage Model, which directly models advantage score i.e., extra reward compared to the expected rewards and regulates score distributions across tasks to prevent reward hacking. 2) Selective Rehearsal, which mitigates catastrophic forgetting by strategically selecting data for PPO training and knowledge rehearsing. Our experimental analysis on public and proprietary datasets reveals that the proposed methods not only increase stability in RLHF training but also achieve higher reward scores and win rates.
PDF111December 15, 2024