ChatPaper.aiChatPaper

SLIME:基于稳定化似然的隐式边界强化偏好优化算法

SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

February 2, 2026
作者: Maksim Afanasyev, Illarion Iov
cs.AI

摘要

直接偏好优化方法已成为對齐大型語言模型時,相較於人類反饋強化學習更具計算效率的替代方案。最新方法通過推導隱式獎勵函數簡化了對齐流程,但普遍存在關鍵的目標失配問題:優化已選與被拒回應之間的相對邊界,並不能保證維持已選回應的絕對似然度。這可能導致「遺忘現象」——模型為滿足邊界約束而降低高質量輸出的概率,以及因過度懲罰被拒序列而引發的「格式崩潰」。本研究提出SLIME(穩定性似然隱式邊界強化算法),這种無參考對齐目標旨在將偏好學習與生成質量解耦。SLIME包含三重目標:(1) 最大化首選回應似然度的錨定項;(2) 防止被拒詞元概率坍縮至零的穩定性懲罰項;(3) 結合硬約束與軟約束的雙重邊界機制,實現精確的決策邊界塑造。實驗結果表明,SLIME在保持更高生成穩定性的同時,相較於現有頂尖基準方法實現了更優異的性能。
English
Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models (LLMs). Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they often suffer from a critical objective mismatch: optimizing the relative margin between chosen and rejected responses does not guarantee the preservation of the chosen response's absolute likelihood. This can lead to ``unlearning'', where the model degrades the probability of high-quality outputs to satisfy margin constraints, and ``formatting collapse'' caused by the over-penalization of rejected sequences. In this work, we introduce SLIME (Stabilized Likelihood Implicit Margin Enforcement), a reference-free alignment objective designed to decouple preference learning from generation quality. SLIME incorporates a three-pronged objective: (1) an anchoring term to maximize the likelihood of preferred responses; (2) a stabilizing penalty that prevents the probabilities of rejected tokens from collapsing to zero; and (3) a dual-margin mechanism that combines hard and soft constraints for precise boundary shaping. Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.
PDF272February 4, 2026