MAPO: 혼합 이점 정책 최적화

초록

파운데이션 모델을 위한 강화 학습의 최근 발전, 특히 그룹 상대 정책 최적화(GRPO)는 추론 작업에서 파운데이션 모델의 성능을 크게 향상시켰습니다. 특히, GRPO에서 트랙젝토리 중요도를 순위 매기는 데 있어 이점 함수(advantage function)가 핵심 메커니즘으로 작용합니다. 그러나 기존 연구에서는 이점 역전(advantage reversion)과 이점 미러(advantage mirror) 문제가 발생하며, 이는 다양한 질의 샘플 간의 합리적인 이점 할당을 방해합니다. 본 연구에서는 간단하지만 효과적인 GRPO 전략인 혼합 이점 정책 최적화(MAPO)를 제안합니다. 우리는 트랙젝토리가 서로 다른 확실성(certainty)을 가지고 나타난다는 점을 밝히고, 높은 확실성을 가진 트랙젝토리 샘플에 대해 이점 백분율 편차(advantage percent deviation)를 제안합니다. 더 나아가, 트랙젝토리 확실성이 다양한 샘플에 대해 이점 함수를 동적으로 재가중함으로써, 샘플별 특성을 고려하여 이점 함수를 적응적으로 구성합니다. 관련 최신 방법과의 비교 및 다양한 이점 변형에 대한 절제 연구(ablation study)를 통해 우리 접근법의 효과성을 검증합니다.

English

Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.

MAPO: 혼합 이점 정책 최적화

MAPO: Mixed Advantage Policy Optimization

초록

Support