瞭解線上和線下對齊演算法之性能差異
Understanding the performance gap between online and offline alignment algorithms
May 14, 2024
作者: Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, Will Dabney
cs.AI
摘要
從人類反饋中學習的強化學習(RLHF)是大型語言模型對齊的經典框架。然而,離線對齊算法日益普及,挑戰了RLHF中對政策採樣的需求。在獎勵過度優化的背景下,我們從一組開放性實驗開始,展示了在線方法明顯優於離線方法的優勢。這促使我們通過一系列精心設計的實驗剔除來調查性能差異的原因。我們實證表明,僅離線數據覆蓋率和數據質量等假設無法令人信服地解釋性能差異。我們還發現,雖然離線算法訓練政策在進行成對分類時表現良好,但在生成方面表現較差;與此同時,由在線算法訓練的政策在生成方面表現良好,但在成對分類方面表現較差。這暗示了區分和生成能力之間的獨特相互作用,這在很大程度上受到採樣過程的影響。最後,我們觀察到性能差異對比損失函數和非對比損失函數均存在,並且似乎通過簡單擴展政策網絡無法解決。綜合而言,我們的研究揭示了在AI對齊中對政策採樣的至關重要作用,並暗示了離線對齊算法的某些基本挑戰。
English
Reinforcement learning from human feedback (RLHF) is the canonical framework
for large language model alignment. However, rising popularity in offline
alignment algorithms challenge the need for on-policy sampling in RLHF. Within
the context of reward over-optimization, we start with an opening set of
experiments that demonstrate the clear advantage of online methods over offline
methods. This prompts us to investigate the causes to the performance
discrepancy through a series of carefully designed experimental ablations. We
show empirically that hypotheses such as offline data coverage and data quality
by itself cannot convincingly explain the performance difference. We also find
that while offline algorithms train policy to become good at pairwise
classification, it is worse at generations; in the meantime the policies
trained by online algorithms are good at generations while worse at pairwise
classification. This hints at a unique interplay between discriminative and
generative capabilities, which is greatly impacted by the sampling process.
Lastly, we observe that the performance discrepancy persists for both
contrastive and non-contrastive loss functions, and appears not to be addressed
by simply scaling up policy networks. Taken together, our study sheds light on
the pivotal role of on-policy sampling in AI alignment, and hints at certain
fundamental challenges of offline alignment algorithms.Summary
AI-Generated Summary