直接對齊演算法之間的差異變得模糊。
The Differences Between Direct Alignment Algorithms are a Blur
February 3, 2025
作者: Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov
cs.AI
摘要
直接對齊算法(DAAs)通過在從人類反饋中學習強化學習(RL)和獎勵建模(RM)的過程中使用直接策略優化,簡化了語言模型的對齊。DAAs可以通過它們的排名損失(成對vs.點對)、在這些損失中使用的獎勵(例如,政策和參考政策的可能性比或勝算比),或者是否需要監督微調(SFT)階段(兩階段vs.單階段)來進行分類。我們首先展示單階段方法表現不如兩階段方法。為了解決這個問題,我們將一個明確的SFT階段和引入控制偏好優化強度的beta參數納入單階段ORPO和ASFT中。這些修改提高了它們在Alpaca Eval 2中的性能,ORPO提高了+3.46,ASFT提高了+8.27,與DPO等兩階段方法相匹配。進一步的分析顯示,關鍵因素是方法使用成對還是點對目標,而不是具體的隱式獎勵或損失函數。這些結果突顯了仔細評估的重要性,以避免對對齊算法的性能提升或整體優越性作出過早的宣稱。
English
Direct Alignment Algorithms (DAAs) simplify language model alignment by
replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement
Learning from Human Feedback (RLHF) with direct policy optimization. DAAs can
be classified by their ranking losses (pairwise vs. pointwise), by the rewards
used in those losses (e.g., likelihood ratios of policy and reference policy,
or odds ratios), or by whether a Supervised Fine-Tuning (SFT) phase is required
(two-stage vs. one-stage). We first show that one-stage methods underperform
two-stage methods. To address this, we incorporate an explicit SFT phase and
introduce the beta parameter, controlling the strength of preference
optimization, into single-stage ORPO and ASFT. These modifications improve
their performance in Alpaca Eval 2 by +3.46 (ORPO) and +8.27 (ASFT),
matching two-stage methods like DPO. Further analysis reveals that the key
factor is whether the approach uses pairwise or pointwise objectives, rather
than the specific implicit reward or loss function. These results highlight the
importance of careful evaluation to avoid premature claims of performance gains
or overall superiority in alignment algorithms.Summary
AI-Generated Summary