ChatPaper.aiChatPaper

RePO:基於ReLU的偏好優化

RePO: ReLU-based Preference Optimization

March 10, 2025
作者: Junkang Wu, Kexin Huang, Xue Wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, Xiang Wang
cs.AI

摘要

對齊大型語言模型(LLMs)與人類偏好對於實際部署至關重要,然而現有方法如RLHF面臨計算和穩定性挑戰。雖然DPO建立了單一超參數beta的離線範式,但後續方法如SimPO通過雙參數(beta, gamma)重新引入了複雜性。我們提出了{基於ReLU的偏好優化(RePO)},這是一種簡化的算法,通過兩項進展消除了beta:(1)保留SimPO的無參考邊界,但通過梯度分析移除beta;(2)採用基於ReLU的最大邊界損失,自然過濾掉無意義的配對。理論上,RePO被描述為SimPO的極限情況(beta趨於無窮大),其中邏輯加權崩潰為二值閾值,形成0-1損失的凸包。在AlpacaEval 2和Arena-Hard上的實驗結果顯示,RePO在多個基礎模型上均優於DPO和SimPO,僅需調整一個超參數。
English
Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter beta, subsequent methods like SimPO reintroduce complexity through dual parameters (beta, gamma). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates beta via two advances: (1) retaining SimPO's reference-free margins but removing beta through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case (beta to infty), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.

Summary

AI-Generated Summary

PDF22March 11, 2025