f-GRPO与进阶探索：基于散度的通用大语言模型对齐强化学习算法

摘要

最新研究表明，偏好对齐目标可视为对齐（被选）与未对齐（被拒）响应分布之间的散度估计量。本研究将这种基于散度的视角拓展至通用对齐场景，例如仅存在环境奖励的可验证奖励强化学习（RLVR）场景。在此统一框架下，我们基于f-散度的变分表示提出了两类方法：适用于通用大语言模型对齐的f-群组相对策略优化（f-GRPO）——一种在线策略强化学习算法，以及f-混合对齐损失（f-HAL）——融合在线/离线策略的混合目标函数。理论分析证明，这些目标函数类能在对齐后提升平均奖励。在RLVR（数学推理）和偏好对齐（安全对齐）任务上的实证结果表明，相较于现有方法，该框架具有更优异的性能与灵活性。

English

Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.

f-GRPO与进阶探索：基于散度的通用大语言模型对齐强化学习算法

f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

摘要

Support