MAPO: 混合アドバンテージ政策最適化

要旨

基盤モデルのための強化学習における最近の進展、特にGroup Relative Policy Optimization（GRPO）は、推論タスクにおける基盤モデルの性能を大幅に向上させています。注目すべきは、GRPOにおいて軌道の重要度をランク付けする中心的なメカニズムとして利得関数が機能している点です。しかし、既存の研究では利得反転と利得ミラーの問題が生じており、異なるクエリサンプル間での合理的な利得配分を妨げています。本研究では、シンプルでありながら効果的なGRPO戦略であるMixed Advantage Policy Optimization（MAPO）を提案します。我々は、軌道が異なる確実性で現れることを明らかにし、高確実性軌道を持つサンプルに対して利得百分率偏差を導入します。さらに、軌道の確実性が異なるサンプルに対して利得関数を動的に再重み付けし、サンプル固有の特性を考慮した利得関数の適応的な設定を実現します。関連する最先端手法との比較、および異なる利得バリアントに関するアブレーション研究を通じて、本アプローチの有効性を検証しました。

English

Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.

MAPO: 混合アドバンテージ政策最適化

MAPO: Mixed Advantage Policy Optimization

要旨

Support