WARM: 가중 평균 보상 모델의 장점에 관하여

초록

강화 학습(RLHF)을 통해 대형 언어 모델(LLMs)을 인간의 선호도에 맞추는 과정에서 보상 모델(RM)의 결함을 악용하여 실제 목표를 달성하지 못한 채 높은 보상을 얻으려는 보상 해킹(reward hacking)이 발생할 수 있다. 본 연구에서는 보상 해킹을 완화하기 위해 보상 모델을 설계할 때 직면하는 두 가지 주요 문제를 확인하였다: RL 과정 중 발생하는 분포 변화와 인간 선호도의 불일치이다. 이를 해결하기 위해, 우리는 먼저 여러 보상 모델을 미세 조정한 후 가중치 공간에서 평균화하는 가중치 평균 보상 모델(WARM)을 제안한다. 이 전략은 동일한 사전 학습을 공유할 때 미세 조정된 가중치가 선형적으로 연결된 상태를 유지한다는 관찰에 기반한다. 가중치를 평균화함으로써, WARM은 전통적인 예측 앙상블 방식에 비해 효율성을 높이면서도 분포 변화에 대한 신뢰성과 선호도 불일치에 대한 견고성을 개선한다. 요약 작업에서 best-of-N 및 RL 방법을 사용한 실험 결과, WARM은 LLM 예측의 전반적인 품질과 정렬을 향상시키는 것으로 나타났다. 예를 들어, WARM으로 미세 조정된 RL 정책은 단일 보상 모델로 미세 조정된 RL 정책에 대해 79.4%의 승률을 보였다.

English

Aligning large language models (LLMs) with human preferences through reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward model (RM) to achieve seemingly high rewards without meeting the underlying objectives. We identify two primary challenges when designing RMs to mitigate reward hacking: distribution shifts during the RL process and inconsistencies in human preferences. As a solution, we propose Weight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then averaging them in the weight space. This strategy follows the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to the traditional ensembling of predictions, while improving reliability under distribution shifts and robustness to preference inconsistencies. Our experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions; for example, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM.

WARM: 가중 평균 보상 모델의 장점에 관하여

WARM: On the Benefits of Weight Averaged Reward Models

초록

Support