學習對齊,對齊學習:自優化對齊的統一方法
Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment
August 11, 2025
作者: Haowen Wang, Yun Yue, Zhiling Ye, Shuowen Zhang, Lei Fan, Jiaxin Liang, Jiadi Jiang, Cheng Wei, Jingyuan Deng, Xudong Han, Ji Li, Chunxiao Guo, Peng Wei, Jian Wang, Jinjie Gu
cs.AI
摘要
對齊方法論已成為提升語言模型對齊能力的關鍵途徑。雖然監督式微調(SFT)通過直接的詞元級損失干預加速了收斂,但其效能受制於離線策略軌跡。相比之下,強化學習(RL)促進了探索性策略優化,但存在樣本效率低且對高質量基礎模型依賴嚴格的問題。為應對這雙重挑戰,我們提出了群組相對對齊優化(GRAO),這是一個統一框架,通過三項關鍵創新協同SFT和RL的各自優勢:1)多樣本生成策略,通過獎勵反饋實現質量比較評估;2)新穎的群組直接對齊損失公式,利用組內相對優勢加權;3)基於成對偏好動態的參考感知參數更新。我們的理論分析確立了GRAO相較傳統方法的收斂保證和樣本效率優勢。在複雜的人類對齊任務上的全面評估顯示,GRAO表現出卓越性能,相較於SFT、DPO、PPO和GRPO基線,分別實現了57.70%、17.65%、7.95%和5.18%的相對提升。本工作不僅提供了一個理論基礎紮實的對齊框架,還為語言模型能力的高效進化提供了實證依據。
English
Alignment methodologies have emerged as a critical pathway for enhancing
language model alignment capabilities. While SFT (supervised fine-tuning)
accelerates convergence through direct token-level loss intervention, its
efficacy is constrained by offline policy trajectory. In contrast,
RL(reinforcement learning) facilitates exploratory policy optimization, but
suffers from low sample efficiency and stringent dependency on high-quality
base models. To address these dual challenges, we propose GRAO (Group Relative
Alignment Optimization), a unified framework that synergizes the respective
strengths of SFT and RL through three key innovations: 1) A multi-sample
generation strategy enabling comparative quality assessment via reward
feedback; 2) A novel Group Direct Alignment Loss formulation leveraging
intra-group relative advantage weighting; 3) Reference-aware parameter updates
guided by pairwise preference dynamics. Our theoretical analysis establishes
GRAO's convergence guarantees and sample efficiency advantages over
conventional approaches. Comprehensive evaluations across complex human
alignment tasks demonstrate GRAO's superior performance, achieving
57.70\%,17.65\% 7.95\% and 5.18\% relative improvements over SFT, DPO, PPO and
GRPO baselines respectively. This work provides both a theoretically grounded
alignment framework and empirical evidence for efficient capability evolution
in language models.