赫尔德策略优化

摘要

群体相对策略优化（Group Relative Policy Optimisation, GRPO）通过估计采样轨迹组内部的优势值来增强大语言模型。然而，将这些轨迹级优势映射到策略更新时，需要聚合每个序列内的词元级概率。在此步骤中依赖固定的聚合机制从根本上限制了算法的适应性。实验表明，存在关键性权衡：某些固定聚合方法频繁导致训练崩溃，而另一些则无法产生满意的性能。为解决这一问题，我们提出HölderPO——一个通过赫尔德均值统一词元级概率聚合的通用策略优化框架。通过显式调节参数p，该框架能够持续控制梯度集中度与方差界之间的权衡。理论证明，较大的p值会集中梯度以放大稀疏学习信号，而较小的p值则严格约束梯度方差。由于静态配置无法普遍解决这种集中度-稳定性权衡，我们基于该框架设计了一种动态退火算法，在训练过程中逐步调整p值。大量评估表明，相较于现有基线方法，该方法具有更优的稳定性和收敛性。具体而言，我们的方法在多个数学基准测试中实现了54.9%的平均准确率（当前最优水平），相较标准GRPO获得7.2%的相对提升，并在ALFWorld任务中取得了93.8%的卓越成功率。

English

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter p, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger p concentrates the gradient to amplify sparse learning signals, whereas a smaller p strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules p across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of 54.9% across multiple mathematical benchmarks, yielding a substantial 7.2% relative gain over standard GRPO and secures an exceptional 93.8% success rate on ALFWorld.