直接對齊算法中獎勵模型過度優化的規模定律

摘要

從人類反饋中學習的強化學習（RLHF）對於大型語言模型（LLMs）最近的成功至關重要，然而，這往往是一個複雜且脆弱的過程。在傳統的RLHF框架中，首先訓練一個獎勵模型來代表人類偏好，然後由在線強化學習（RL）算法使用該模型來優化LLM。這種方法的一個突出問題是獎勵過度優化或獎勵破解，即通過學習的代理獎勵模型來衡量的性能提高，但真實質量達到平穩甚至惡化。直接對齊算法（DDAs）如直接偏好優化已經成為傳統RLHF流程的替代方案，通過避開獎勵建模階段。然而，儘管DDAs不使用單獨的代理獎勵模型，它們仍常常因過度優化而惡化。雖然對於DDAs來說所謂的獎勵破解現象並不明確，我們仍然發現類似的趨勢：在較高的KL預算下，DAA算法展現出與其傳統RLHF對應物相似的惡化模式。特別是，我們發現DAA方法不僅在各種KL預算範圍內惡化，而且通常甚至在數據集完成一個時期之前就出現惡化。通過大量的實證實驗，本研究對DDAs的獎勵過度優化或破解問題進行了制定和形式化，並探討了其在目標、訓練制度和模型規模上的後果。

English

Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained to represent human preferences, which is in turn used by an online reinforcement learning (RL) algorithm to optimize the LLM. A prominent issue with such methods is reward over-optimization or reward hacking, where performance as measured by the learned proxy reward model increases, but true quality plateaus or even deteriorates. Direct Alignment Algorithms (DDAs) like Direct Preference Optimization have emerged as alternatives to the classical RLHF pipeline by circumventing the reward modeling phase. However, although DAAs do not use a separate proxy reward model, they still commonly deteriorate from over-optimization. While the so-called reward hacking phenomenon is not well-defined for DAAs, we still uncover similar trends: at higher KL budgets, DAA algorithms exhibit similar degradation patterns to their classic RLHF counterparts. In particular, we find that DAA methods deteriorate not only across a wide range of KL budgets but also often before even a single epoch of the dataset is completed. Through extensive empirical experimentation, this work formulates and formalizes the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.

直接對齊算法中獎勵模型過度優化的規模定律

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

摘要

Support