MAPO：混合优势策略优化

摘要

近期，在基础模型强化学习领域，如群体相对策略优化（GRPO）等技术的进展，显著提升了基础模型在推理任务中的表现。值得注意的是，优势函数在GRPO中作为核心机制，用于评估轨迹的重要性。然而，现有研究面临优势反转和优势镜像问题，这阻碍了不同查询样本间合理的优势分配。本研究中，我们提出了一种简便而有效的GRPO策略——混合优势策略优化（MAPO）。我们发现轨迹呈现不同的确定性，并针对高确定性轨迹样本提出了优势百分比偏差的概念。此外，我们根据轨迹确定性的变化动态调整优势函数的权重，从而自适应地配置优势函数，以考虑样本的特定特性。与相关最先进方法的对比，以及对不同优势变体的消融研究，均验证了我们方法的有效性。

English

Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.

MAPO：混合优势策略优化

MAPO: Mixed Advantage Policy Optimization

摘要

Support