MAPO:混合優勢策略優化
MAPO: Mixed Advantage Policy Optimization
September 23, 2025
作者: Wenke Huang, Quan Zhang, Yiyang Fang, Jian Liang, Xuankun Rong, Huanjin Yao, Guancheng Wan, Ke Liang, Wenwen He, Mingjun Li, Leszek Rutkowski, Mang Ye, Bo Du, Dacheng Tao
cs.AI
摘要
近期,在基礎模型的強化學習領域,如群體相對策略優化(GRPO),取得了顯著進展,大幅提升了基礎模型在推理任務上的表現。值得注意的是,優勢函數在GRPO中作為核心機制,用於排序軌跡的重要性。然而,現有研究面臨優勢反轉和優勢鏡像問題,這阻礙了不同查詢樣本間合理的優勢分配。在本研究中,我們提出了一種簡單但有效的GRPO策略,即混合優勢策略優化(MAPO)。我們揭示了軌跡呈現出不同的確定性,並針對高確定性軌跡的樣本提出了優勢百分比偏差。此外,我們根據軌跡確定性的變化動態重新加權優勢函數,從而自適應地配置優勢函數以考慮樣本特定特性。與相關的最新方法進行比較,以及對不同優勢變體的消融研究,驗證了我們方法的有效性。
English
Recent advances in reinforcement learning for foundation models, such as
Group Relative Policy Optimization (GRPO), have significantly improved the
performance of foundation models on reasoning tasks. Notably, the advantage
function serves as a central mechanism in GRPO for ranking the trajectory
importance. However, existing explorations encounter both advantage reversion
and advantage mirror problems, which hinder the reasonable advantage allocation
across different query samples. In this work, we propose an easy but effective
GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the
trajectory appears with different certainty and propose the advantage percent
deviation for samples with high-certainty trajectories. Furthermore, we
dynamically reweight the advantage function for samples with varying trajectory
certainty, thereby adaptively configuring the advantage function to account for
sample-specific characteristics. Comparison with related state-of-the-art
methods, along with ablation studies on different advantage variants, validates
the effectiveness of our approach.