在线和离线对齐算法之间性能差距的理解

摘要

人类反馈强化学习（RLHF）是大型语言模型对齐的经典框架。然而，离线对齐算法的日益流行挑战了RLHF中对策略采样的需求。在奖励过度优化的背景下，我们首先进行了一系列实验，展示在线方法明显优于离线方法的优势。这促使我们通过一系列精心设计的实验剖析来调查性能差异的原因。我们实证表明，离线数据覆盖率和数据质量等假设本身并不能令人信服地解释性能差异。我们还发现，尽管离线算法训练策略在成对分类方面表现良好，但在生成方面表现较差；与此同时，在线算法训练的策略在生成方面表现良好，但在成对分类方面表现较差。这暗示了辨别和生成能力之间的独特相互作用，这一相互作用在很大程度上受到采样过程的影响。最后，我们观察到性能差异在对比损失和非对比损失函数中仍然存在，并且似乎不能通过简单扩展策略网络来解决。综上所述，我们的研究揭示了在AI对齐中对策略采样的关键作用，并暗示了离线对齐算法的某些基本挑战。

English

Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need for on-policy sampling in RLHF. Within the context of reward over-optimization, we start with an opening set of experiments that demonstrate the clear advantage of online methods over offline methods. This prompts us to investigate the causes to the performance discrepancy through a series of carefully designed experimental ablations. We show empirically that hypotheses such as offline data coverage and data quality by itself cannot convincingly explain the performance difference. We also find that while offline algorithms train policy to become good at pairwise classification, it is worse at generations; in the meantime the policies trained by online algorithms are good at generations while worse at pairwise classification. This hints at a unique interplay between discriminative and generative capabilities, which is greatly impacted by the sampling process. Lastly, we observe that the performance discrepancy persists for both contrastive and non-contrastive loss functions, and appears not to be addressed by simply scaling up policy networks. Taken together, our study sheds light on the pivotal role of on-policy sampling in AI alignment, and hints at certain fundamental challenges of offline alignment algorithms.

在线和离线对齐算法之间性能差距的理解

Understanding the performance gap between online and offline alignment algorithms

摘要

Support