直接アライメントアルゴリズム間の違いは曖昧です。

要旨

直接アライメントアルゴリズム（DAAs）は、人間のフィードバックからの強化学習（RLHF）における強化学習（RL）と報酬モデリング（RM）を直接ポリシー最適化で置き換えることにより、言語モデルのアライメントを簡素化します。DAAsは、ランキング損失（ペアワイズ対ポイントワイズ）、それらの損失で使用される報酬（例：ポリシーと参照ポリシーの尤度比やオッズ比）、または教師ありファインチューニング（SFT）フェーズが必要かどうか（二段階対一段階）によって分類できます。まず、一段階法が二段階法よりも性能が低いことを示します。これを解決するために、明示的なSFTフェーズを組み込み、単一段階のORPOとASFTに、好み最適化の強度を制御するベータパラメータを導入します。これらの修正により、Alpaca Eval 2におけるパフォーマンスが+3.46（ORPO）および+8.27（ASFT）向上し、DPOのような二段階法に匹敵します。さらなる分析から、アプローチがペアワイズまたはポイントワイズの目的を使用するかどうかが重要な要素であり、特定の暗黙の報酬や損失関数よりも重要であることが明らかになります。これらの結果は、アライメントアルゴリズムにおける性能向上や全体的な優越性の早まった主張を避けるための注意深い評価の重要性を強調しています。

English

Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement Learning from Human Feedback (RLHF) with direct policy optimization. DAAs can be classified by their ranking losses (pairwise vs. pointwise), by the rewards used in those losses (e.g., likelihood ratios of policy and reference policy, or odds ratios), or by whether a Supervised Fine-Tuning (SFT) phase is required (two-stage vs. one-stage). We first show that one-stage methods underperform two-stage methods. To address this, we incorporate an explicit SFT phase and introduce the beta parameter, controlling the strength of preference optimization, into single-stage ORPO and ASFT. These modifications improve their performance in Alpaca Eval 2 by +3.46 (ORPO) and +8.27 (ASFT), matching two-stage methods like DPO. Further analysis reveals that the key factor is whether the approach uses pairwise or pointwise objectives, rather than the specific implicit reward or loss function. These results highlight the importance of careful evaluation to avoid premature claims of performance gains or overall superiority in alignment algorithms.

直接アライメントアルゴリズム間の違いは曖昧です。

The Differences Between Direct Alignment Algorithms are a Blur

要旨

Support