MAPO：混合優勢策略優化

摘要

近期，在基礎模型的強化學習領域，如群體相對策略優化（GRPO），取得了顯著進展，大幅提升了基礎模型在推理任務上的表現。值得注意的是，優勢函數在GRPO中作為核心機制，用於排序軌跡的重要性。然而，現有研究面臨優勢反轉和優勢鏡像問題，這阻礙了不同查詢樣本間合理的優勢分配。在本研究中，我們提出了一種簡單但有效的GRPO策略，即混合優勢策略優化（MAPO）。我們揭示了軌跡呈現出不同的確定性，並針對高確定性軌跡的樣本提出了優勢百分比偏差。此外，我們根據軌跡確定性的變化動態重新加權優勢函數，從而自適應地配置優勢函數以考慮樣本特定特性。與相關的最新方法進行比較，以及對不同優勢變體的消融研究，驗證了我們方法的有效性。

English

Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.

MAPO：混合優勢策略優化

MAPO: Mixed Advantage Policy Optimization

摘要

Support