學習對齊，對齊學習：自優化對齊的統一方法

摘要

對齊方法論已成為提升語言模型對齊能力的關鍵途徑。雖然監督式微調（SFT）通過直接的詞元級損失干預加速了收斂，但其效能受制於離線策略軌跡。相比之下，強化學習（RL）促進了探索性策略優化，但存在樣本效率低且對高質量基礎模型依賴嚴格的問題。為應對這雙重挑戰，我們提出了群組相對對齊優化（GRAO），這是一個統一框架，通過三項關鍵創新協同SFT和RL的各自優勢：1）多樣本生成策略，通過獎勵反饋實現質量比較評估；2）新穎的群組直接對齊損失公式，利用組內相對優勢加權；3）基於成對偏好動態的參考感知參數更新。我們的理論分析確立了GRAO相較傳統方法的收斂保證和樣本效率優勢。在複雜的人類對齊任務上的全面評估顯示，GRAO表現出卓越性能，相較於SFT、DPO、PPO和GRPO基線，分別實現了57.70%、17.65%、7.95%和5.18%的相對提升。本工作不僅提供了一個理論基礎紮實的對齊框架，還為語言模型能力的高效進化提供了實證依據。

English

Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by pairwise preference dynamics. Our theoretical analysis establishes GRAO's convergence guarantees and sample efficiency advantages over conventional approaches. Comprehensive evaluations across complex human alignment tasks demonstrate GRAO's superior performance, achieving 57.70\%,17.65\% 7.95\% and 5.18\% relative improvements over SFT, DPO, PPO and GRPO baselines respectively. This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models.

學習對齊，對齊學習：自優化對齊的統一方法

Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment

摘要

Support