ChatPaper.aiChatPaper

赫尔德策略优化

Hölder Policy Optimisation

May 12, 2026
作者: Yuxiang Chen, Dingli Liang, Yihang Chen, Ziqin Gong, Chenyang Le, Zhaokai Wang, Jiachen Zhu, Lingyu Yang, Jianghao Lin, Weinan Zhang, Jun Wang
cs.AI

摘要

群体相对策略优化(Group Relative Policy Optimisation, GRPO)通过估计采样轨迹组内部的优势值来增强大语言模型。然而,将这些轨迹级优势映射到策略更新时,需要聚合每个序列内的词元级概率。在此步骤中依赖固定的聚合机制从根本上限制了算法的适应性。实验表明,存在关键性权衡:某些固定聚合方法频繁导致训练崩溃,而另一些则无法产生满意的性能。为解决这一问题,我们提出HölderPO——一个通过赫尔德均值统一词元级概率聚合的通用策略优化框架。通过显式调节参数p,该框架能够持续控制梯度集中度与方差界之间的权衡。理论证明,较大的p值会集中梯度以放大稀疏学习信号,而较小的p值则严格约束梯度方差。由于静态配置无法普遍解决这种集中度-稳定性权衡,我们基于该框架设计了一种动态退火算法,在训练过程中逐步调整p值。大量评估表明,相较于现有基线方法,该方法具有更优的稳定性和收敛性。具体而言,我们的方法在多个数学基准测试中实现了54.9%的平均准确率(当前最优水平),相较标准GRPO获得7.2%的相对提升,并在ALFWorld任务中取得了93.8%的卓越成功率。
English
Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter p, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger p concentrates the gradient to amplify sparse learning signals, whereas a smaller p strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules p across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of 54.9% across multiple mathematical benchmarks, yielding a substantial 7.2% relative gain over standard GRPO and secures an exceptional 93.8% success rate on ALFWorld.