SLIME：基于稳定似然的隐式边界强化偏好优化方法

摘要

直接偏好优化方法已成为人类反馈强化学习（RLHF）的一种高效计算替代方案，用于对齐大语言模型。最新方法通过推导隐式奖励函数简化了对齐流程，但普遍存在关键的目标失配问题：优化选定回复与拒绝回复之间的相对边际并不能保证维持选定回复的绝对似然度。这可能导致“遗忘现象”——模型为满足边际约束而降低高质量输出的概率，以及因过度惩罚拒绝序列引发的“格式崩塌”。本研究提出SLIME（稳定似然隐式边际约束），一种无需参考模型的对齐目标，旨在解耦偏好学习与生成质量。SLIME包含三重目标：（1）最大化优选回复似然度的锚定项；（2）防止拒绝标记概率坍缩至零的稳定惩罚项；（3）结合硬约束与软约束的双重边际机制，用于精确边界塑形。实验结果表明，SLIME在保持更高生成稳定性的同时，实现了优于现有基准模型的性能。

English

Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models (LLMs). Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they often suffer from a critical objective mismatch: optimizing the relative margin between chosen and rejected responses does not guarantee the preservation of the chosen response's absolute likelihood. This can lead to ``unlearning'', where the model degrades the probability of high-quality outputs to satisfy margin constraints, and ``formatting collapse'' caused by the over-penalization of rejected sequences. In this work, we introduce SLIME (Stabilized Likelihood Implicit Margin Enforcement), a reference-free alignment objective designed to decouple preference learning from generation quality. SLIME incorporates a three-pronged objective: (1) an anchoring term to maximize the likelihood of preferred responses; (2) a stabilizing penalty that prevents the probabilities of rejected tokens from collapsing to zero; and (3) a dual-margin mechanism that combines hard and soft constraints for precise boundary shaping. Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.

SLIME：基于稳定似然的隐式边界强化偏好优化方法

SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

摘要

Support