瞭解線上和線下對齊演算法之性能差異

摘要

從人類反饋中學習的強化學習（RLHF）是大型語言模型對齊的經典框架。然而，離線對齊算法日益普及，挑戰了RLHF中對政策採樣的需求。在獎勵過度優化的背景下，我們從一組開放性實驗開始，展示了在線方法明顯優於離線方法的優勢。這促使我們通過一系列精心設計的實驗剔除來調查性能差異的原因。我們實證表明，僅離線數據覆蓋率和數據質量等假設無法令人信服地解釋性能差異。我們還發現，雖然離線算法訓練政策在進行成對分類時表現良好，但在生成方面表現較差；與此同時，由在線算法訓練的政策在生成方面表現良好，但在成對分類方面表現較差。這暗示了區分和生成能力之間的獨特相互作用，這在很大程度上受到採樣過程的影響。最後，我們觀察到性能差異對比損失函數和非對比損失函數均存在，並且似乎通過簡單擴展政策網絡無法解決。綜合而言，我們的研究揭示了在AI對齊中對政策採樣的至關重要作用，並暗示了離線對齊算法的某些基本挑戰。

English

Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need for on-policy sampling in RLHF. Within the context of reward over-optimization, we start with an opening set of experiments that demonstrate the clear advantage of online methods over offline methods. This prompts us to investigate the causes to the performance discrepancy through a series of carefully designed experimental ablations. We show empirically that hypotheses such as offline data coverage and data quality by itself cannot convincingly explain the performance difference. We also find that while offline algorithms train policy to become good at pairwise classification, it is worse at generations; in the meantime the policies trained by online algorithms are good at generations while worse at pairwise classification. This hints at a unique interplay between discriminative and generative capabilities, which is greatly impacted by the sampling process. Lastly, we observe that the performance discrepancy persists for both contrastive and non-contrastive loss functions, and appears not to be addressed by simply scaling up policy networks. Taken together, our study sheds light on the pivotal role of on-policy sampling in AI alignment, and hints at certain fundamental challenges of offline alignment algorithms.

瞭解線上和線下對齊演算法之性能差異

Understanding the performance gap between online and offline alignment algorithms

摘要

Support