赫爾德策略優化
Hölder Policy Optimisation
May 12, 2026
作者: Yuxiang Chen, Dingli Liang, Yihang Chen, Ziqin Gong, Chenyang Le, Zhaokai Wang, Jiachen Zhu, Lingyu Yang, Jianghao Lin, Weinan Zhang, Jun Wang
cs.AI
摘要
群組相對策略優化(GRPO)通過在取樣軌跡群組中估算優勢來增強大型語言模型。然而,將這些軌跡層級的優勢映射至策略更新需要聚合每個序列內的詞元級別概率。對此步驟依賴固定聚合機制從根本上限制了演算法的適應性。根據經驗,我們觀察到一個關鍵取捨:某些固定聚合時常遭遇訓練崩潰,而其他則無法產生令人滿意的性能。為了解決此問題,我們提出 HölderPO,一個通過赫爾德平均統一詞元級別概率聚合的廣義策略優化框架。通過明確調節參數 p,我們的框架對梯度集中與方差界限之間的取捨提供連續控制。理論上,我們證明較大的 p 能使梯度集中以放大稀疏學習訊號,而較小的 p 則嚴格限制梯度方差。由於沒有靜態配置能普遍解決此集中-穩定取捨,我們以動態退火算法實例化該框架,該算法在訓練生命週期中逐步調度 p。廣泛的評估顯示出相較於現有基線的優越穩定性和收斂性。具體來說,我們的方法在多個數學基準上達到了最先進的平均準確率 54.9%,相較於標準 GRPO 實現了 7.2% 的顯著相對增益,並在 ALFWorld 上取得了卓越的 93.8% 成功率。
English
Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter p, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger p concentrates the gradient to amplify sparse learning signals, whereas a smaller p strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules p across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of 54.9% across multiple mathematical benchmarks, yielding a substantial 7.2% relative gain over standard GRPO and secures an exceptional 93.8% success rate on ALFWorld.