基于边界的语言模型对齐的一个常见陷阱：梯度纠缠

摘要

从人类反馈中进行强化学习（RLHF）已成为语言模型（LM）对齐的主要方法。在其核心，RLHF 使用基于边际的损失进行偏好优化，仅通过首选和次选响应之间的差异来指定理想的 LM 行为。在本文中，我们确定了基于边际方法的一个常见陷阱 -- 对首选和次选响应的理想 LM 行为的不充分规定，这会导致两个意外后果随着边际的增加而出现：（1）次选（例如，不安全）响应的概率可能增加，导致潜在的安全对齐失败。（2）即使那些响应是理想的，首选响应的概率也可能降低。我们揭示了这些问题行为背后的原因：基于边际的损失将首选概率的变化与次选概率的梯度耦合在一起，反之亦然，通常会阻止首选概率增加，而次选概率降低，从而导致两个概率同时增加或减少。我们将这种在基于边际目标中固有的效应称为梯度缠结。在形式上，我们推导了一般基于边际对齐目标的条件，其中梯度缠结变得令人担忧：首选和次选对数概率的梯度的内积相对于各自梯度范数较大。我们从理论上探讨了在对齐语言模型时为什么这样的内积可能很大，并从经验上验证了我们的发现。我们框架的经验影响延伸到解释各种偏好优化算法的训练动态中的重要差异，并建议潜在的算法设计以减轻基于边际方法的不充分规定问题，从而改善语言模型对齐。

English

Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for language model (LM) alignment. At its core, RLHF uses a margin-based loss for preference optimization, specifying ideal LM behavior only by the difference between preferred and dispreferred responses. In this paper, we identify a common pitfall of margin-based methods -- the under-specification of ideal LM behavior on preferred and dispreferred responses individually, which leads to two unintended consequences as the margin increases: (1) The probability of dispreferred (e.g., unsafe) responses may increase, resulting in potential safety alignment failures. (2) The probability of preferred responses may decrease, even when those responses are ideal. We demystify the reasons behind these problematic behaviors: margin-based losses couple the change in the preferred probability to the gradient of the dispreferred one, and vice versa, often preventing the preferred probability from increasing while the dispreferred one decreases, and thus causing a synchronized increase or decrease in both probabilities. We term this effect, inherent in margin-based objectives, gradient entanglement. Formally, we derive conditions for general margin-based alignment objectives under which gradient entanglement becomes concerning: the inner product of the gradients of preferred and dispreferred log-probabilities is large relative to the individual gradient norms. We theoretically investigate why such inner products can be large when aligning language models and empirically validate our findings. Empirical implications of our framework extend to explaining important differences in the training dynamics of various preference optimization algorithms, and suggesting potential algorithm designs to mitigate the under-specification issue of margin-based methods and thereby improving language model alignment.

基于边界的语言模型对齐的一个常见陷阱：梯度纠缠

A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

摘要

Support