어드밴티지 모델과 선택적 리허설을 통한 RLHF 안정화

초록

대형 언어 모델(LLM)은 자연어 처리 분야에 혁신을 가져왔지만, 인간의 가치와 선호도에 맞추기 위해 RLHF(Reinforcement Learning from Human Feedback)를 사용하는 것은 여전히 큰 과제로 남아 있습니다. 이 과제는 보드 해킹(reward hacking)과 치명적 망각(catastrophic forgetting)과 같은 다양한 불안정성으로 특징지어집니다. 본 기술 보고서에서는 RLHF 훈련을 안정화하기 위한 두 가지 혁신적인 방법을 제안합니다: 1) **어드밴티지 모델(Advantage Model)**: 이 모델은 기대 보상 대비 추가 보상인 어드밴티지 점수를 직접 모델링하고, 작업 간 점수 분포를 조절하여 보드 해킹을 방지합니다. 2) **선택적 리허설(Selective Rehearsal)**: 이 방법은 PPO(Proximal Policy Optimization) 훈련과 지식 리허설을 위해 데이터를 전략적으로 선택함으로써 치명적 망각을 완화합니다. 공개 및 독점 데이터셋에 대한 실험 분석 결과, 제안된 방법들은 RLHF 훈련의 안정성을 높일 뿐만 아니라 더 높은 보드 점수와 승률을 달성하는 것으로 나타났습니다.

English

Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences using RLHF remains a significant challenge. This challenge is characterized by various instabilities, such as reward hacking and catastrophic forgetting. In this technical report, we propose two innovations to stabilize RLHF training: 1) Advantage Model, which directly models advantage score i.e., extra reward compared to the expected rewards and regulates score distributions across tasks to prevent reward hacking. 2) Selective Rehearsal, which mitigates catastrophic forgetting by strategically selecting data for PPO training and knowledge rehearsing. Our experimental analysis on public and proprietary datasets reveals that the proposed methods not only increase stability in RLHF training but also achieve higher reward scores and win rates.

어드밴티지 모델과 선택적 리허설을 통한 RLHF 안정화

Stabilizing RLHF through Advantage Model and Selective Rehearsal

초록

Support