WARP: 보상 가중 평균 정책의 이점에 관하여

초록

인간 피드백을 통한 강화 학습(RLHF)은 인간 선호도에 기반해 훈련된 보상 모델을 사용하여 대형 언어 모델(LLM)의 생성물이 높은 보상을 받도록 유도함으로써 이를 정렬합니다. 사전 훈련된 지식의 망각을 방지하기 위해 RLHF는 일반적으로 KL 정규화를 포함하는데, 이는 정책이 지도 학습으로 미세 조정된 초기화 상태에 가깝게 유지되도록 강제하지만, 보상 최적화를 방해합니다. KL과 보상 간의 트레이드오프를 해결하기 위해, 본 논문에서는 Weight Averaged Rewarded Policies(WARP)라는 새로운 정렬 전략을 소개합니다. WARP는 세 가지 단계에서 정책을 가중치 공간에서 병합합니다. 첫째, KL 정규화에서 정책의 지수 이동 평균을 동적 앵커로 사용합니다. 둘째, 독립적으로 미세 조정된 정책들을 구형 보간법을 통해 새로운 강화된 정책으로 병합합니다. 셋째, 이 병합된 모델과 초기화 모델 간의 선형 보간을 적용하여 사전 훈련의 특징을 복원합니다. 이 절차는 반복적으로 적용되며, 각 반복의 최종 모델은 다음 반복의 고급 초기화로 사용되어 KL-보상 파레토 프론트를 점진적으로 개선하고 고정된 KL에서 우수한 보상을 달성합니다. GEMMA 정책에 대한 실험을 통해 WARP가 품질과 정렬을 개선하며 다른 오픈소스 LLM을 능가함을 검증했습니다.

English

Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.

WARP: 보상 가중 평균 정책의 이점에 관하여

WARP: On the Benefits of Weight Averaged Rewarded Policies

초록

Support