ChatPaper.aiChatPaper

自我改進的強健偏好優化

Self-Improving Robust Preference Optimization

June 3, 2024
作者: Eugene Choi, Arash Ahmadian, Matthieu Geist, Oilvier Pietquin, Mohammad Gheshlaghi Azar
cs.AI

摘要

無論是在線還是離線的RLHF方法,如PPO和DPO,在將AI與人類偏好調整方面取得了巨大成功。儘管它們取得了成功,但現有方法存在一個根本問題,即其最優解取決於任務(即對於分布外(OOD)任務不具韌性)。在這裡,我們通過提出自我改進韌性偏好優化SRPO,來應對這一挑戰,這是一個實用且基於數學原則的離線RLHF框架,完全能夠應對任務變化。SRPO的關鍵思想是將從人類偏好中學習的問題形式化為一個自我改進過程,可以通過一個最小-最大目標來數學表達,該目標旨在以對抗方式聯合優化自我改進策略和生成策略。這個優化問題的解與訓練任務無關,因此對其變化具有韌性。然後,我們展示了這個目標可以重新表達為一個非對抗性的離線損失形式,可以在規模上使用標準監督優化技術進行優化,而無需獎勵模型和在線推斷。我們展示了SRPO在AI勝率(WR)對人類(GOLD)完成的效果。特別是,當在OOD XSUM數據集上評估SRPO時,經過5次自我修訂後,其勝率達到90%,明顯優於著名的DPO,超出15%。
English
Both online and offline RLHF methods such as PPO and DPO have been extremely successful in aligning AI with human preferences. Despite their success, the existing methods suffer from a fundamental problem that their optimal solution is highly task-dependent (i.e., not robust to out-of-distribution (OOD) tasks). Here we address this challenge by proposing Self-Improving Robust Preference Optimization SRPO, a practical and mathematically principled offline RLHF framework that is completely robust to the changes in the task. The key idea of SRPO is to cast the problem of learning from human preferences as a self-improvement process, which can be mathematically expressed in terms of a min-max objective that aims at joint optimization of self-improvement policy and the generative policy in an adversarial fashion. The solution for this optimization problem is independent of the training task and thus it is robust to its changes. We then show that this objective can be re-expressed in the form of a non-adversarial offline loss which can be optimized using standard supervised optimization techniques at scale without any need for reward model and online inference. We show the effectiveness of SRPO in terms of AI Win-Rate (WR) against human (GOLD) completions. In particular, when SRPO is evaluated on the OOD XSUM dataset, it outperforms the celebrated DPO by a clear margin of 15% after 5 self-revisions, achieving WR of 90%.

Summary

AI-Generated Summary

PDF201December 12, 2024