学习对齐,对齐学习:一种自优化对齐的统一方法
Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment
August 11, 2025
作者: Haowen Wang, Yun Yue, Zhiling Ye, Shuowen Zhang, Lei Fan, Jiaxin Liang, Jiadi Jiang, Cheng Wei, Jingyuan Deng, Xudong Han, Ji Li, Chunxiao Guo, Peng Wei, Jian Wang, Jinjie Gu
cs.AI
摘要
对齐方法已成为提升语言模型对齐能力的关键途径。尽管监督微调(SFT)通过直接的词元级损失干预加速了收敛,但其效果受限于离线策略轨迹。相比之下,强化学习(RL)促进了探索性策略优化,却面临样本效率低下及对高质量基础模型严格依赖的挑战。为应对这两大难题,我们提出了群体相对对齐优化(GRAO),一个统一框架,通过三项关键创新融合了SFT与RL的各自优势:1)多样本生成策略,借助奖励反馈实现质量比较评估;2)新颖的群体直接对齐损失公式,利用组内相对优势加权;3)基于成对偏好动态的参考感知参数更新。我们的理论分析确立了GRAO相较于传统方法的收敛保证与样本效率优势。在复杂人类对齐任务上的全面评估显示,GRAO表现卓越,相较于SFT、DPO、PPO及GRPO基线,分别实现了57.70%、17.65%、7.95%和5.18%的相对提升。本研究不仅提供了一个理论扎实的对齐框架,还为语言模型能力的高效进化提供了实证依据。
English
Alignment methodologies have emerged as a critical pathway for enhancing
language model alignment capabilities. While SFT (supervised fine-tuning)
accelerates convergence through direct token-level loss intervention, its
efficacy is constrained by offline policy trajectory. In contrast,
RL(reinforcement learning) facilitates exploratory policy optimization, but
suffers from low sample efficiency and stringent dependency on high-quality
base models. To address these dual challenges, we propose GRAO (Group Relative
Alignment Optimization), a unified framework that synergizes the respective
strengths of SFT and RL through three key innovations: 1) A multi-sample
generation strategy enabling comparative quality assessment via reward
feedback; 2) A novel Group Direct Alignment Loss formulation leveraging
intra-group relative advantage weighting; 3) Reference-aware parameter updates
guided by pairwise preference dynamics. Our theoretical analysis establishes
GRAO's convergence guarantees and sample efficiency advantages over
conventional approaches. Comprehensive evaluations across complex human
alignment tasks demonstrate GRAO's superior performance, achieving
57.70\%,17.65\% 7.95\% and 5.18\% relative improvements over SFT, DPO, PPO and
GRPO baselines respectively. This work provides both a theoretically grounded
alignment framework and empirical evidence for efficient capability evolution
in language models.