f-GRPO及其拓展:基于散度的通用大语言模型对齐强化学习算法
f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
February 5, 2026
作者: Rajdeep Haldar, Lantao Mei, Guang Lin, Yue Xing, Qifan Song
cs.AI
摘要
最新研究表明,偏好对齐目标可视为已对齐(被选)与未对齐(被拒)响应分布之间的散度估计量。本研究将这种基于散度的视角扩展至通用对齐场景,例如仅存在环境奖励的可验证奖励强化学习场景。在此统一框架下,我们基于f-散度的变分表示提出了两类通用大语言模型对齐方法:f-群组相对策略优化(一类在线策略强化学习算法)和f-混合对齐损失(融合在线/离线策略的优化目标)。理论分析证明,这些目标函数类能在对齐后提升平均奖励。通过数学推理(RLVR)和安全对齐(PA)任务的实证验证,本框架在性能与灵活性方面均优于现有方法。
English
Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.