직접 정렬 알고리즘에서 보상 모델 과적합화에 대한 스케일링 법칙

초록

인간 피드백을 통한 강화 학습(Reinforcement Learning from Human Feedback, RLHF)은 대규모 언어 모델(Large Language Models, LLMs)의 최근 성공에 중요한 역할을 해왔지만, 이는 종종 복잡하고 취약한 과정입니다. 기존의 RLHF 프레임워크에서는 먼저 인간의 선호도를 나타내기 위해 보상 모델을 학습시키고, 이를 온라인 강화 학습(RL) 알고리즘이 LLM을 최적화하는 데 사용합니다. 이러한 방법의 주요 문제는 보상 과최적화 또는 보상 해킹으로, 학습된 대리 보상 모델로 측정된 성능은 증가하지만 실제 품질은 정체되거나 심지어 악화되는 현상입니다. 직접 정렬 알고리즘(Direct Alignment Algorithms, DAAs)인 직접 선호 최적화(Direct Preference Optimization)와 같은 방법들은 보상 모델링 단계를 우회함으로써 기존 RLHF 파이프라인의 대안으로 등장했습니다. 그러나 DAAs는 별도의 대리 보상 모델을 사용하지 않음에도 불구하고 여전히 과최적화로 인해 악화되는 경우가 많습니다. DAAs에 대한 소위 보상 해킹 현상은 명확히 정의되지는 않았지만, 우리는 유사한 경향을 발견했습니다: 더 높은 KL 예산에서 DAA 알고리즘은 기존 RLHF와 유사한 성능 저하 패턴을 보입니다. 특히, DAA 방법들은 다양한 KL 예산 범위에서뿐만 아니라 종종 데이터셋의 단일 에포크가 완료되기도 전에 악화되는 것으로 나타났습니다. 본 연구는 광범위한 실험을 통해 DAAs의 보상 과최적화 또는 해킹 문제를 공식화하고, 이를 목표, 학습 체계, 모델 규모에 걸쳐 그 영향을 탐구합니다.

English

Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained to represent human preferences, which is in turn used by an online reinforcement learning (RL) algorithm to optimize the LLM. A prominent issue with such methods is reward over-optimization or reward hacking, where performance as measured by the learned proxy reward model increases, but true quality plateaus or even deteriorates. Direct Alignment Algorithms (DDAs) like Direct Preference Optimization have emerged as alternatives to the classical RLHF pipeline by circumventing the reward modeling phase. However, although DAAs do not use a separate proxy reward model, they still commonly deteriorate from over-optimization. While the so-called reward hacking phenomenon is not well-defined for DAAs, we still uncover similar trends: at higher KL budgets, DAA algorithms exhibit similar degradation patterns to their classic RLHF counterparts. In particular, we find that DAA methods deteriorate not only across a wide range of KL budgets but also often before even a single epoch of the dataset is completed. Through extensive empirical experimentation, this work formulates and formalizes the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.

직접 정렬 알고리즘에서 보상 모델 과적합화에 대한 스케일링 법칙

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

초록

Support