횔더 정책 최적화

초록

그룹 상대 정책 최적화(GRPO)는 샘플링된 궤적 그룹 전체에 걸쳐 이점을 추정함으로써 대규모 언어 모델을 향상시킨다. 그러나 이러한 궤적 수준의 이점을 정책 업데이트에 매핑하려면 각 시퀀스 내의 토큰 수준 확률을 집계해야 한다. 이 단계에서 고정된 집계 메커니즘에 의존하는 것은 알고리즘의 적응성을 근본적으로 제한한다. 실증적으로, 우리는 중요한 트레이드오프를 관찰한다. 특정 고정 집계 방식은 훈련 붕괴를 자주 겪는 반면, 다른 방식은 만족스러운 성능을 내지 못한다. 이러한 문제를 해결하기 위해, 우리는 획덜 평균을 통해 토큰 수준 확률 집계를 통합하는 일반화된 정책 최적화 프레임워크인 HölderPO를 제안한다. 매개변수 p를 명시적으로 조절함으로써, 우리의 프레임워크는 기울기 집중도와 분산 한계 간의 트레이드오프에 대한 연속적인 제어를 제공한다. 이론적으로, 우리는 더 큰 p가 기울기를 집중시켜 희소 학습 신호를 증폭시키는 반면, 더 작은 p는 기울기 분산을 엄격하게 제한함을 증명한다. 정적 구성으로는 이러한 집중-안정성 트레이드오프를 보편적으로 해결할 수 없기 때문에, 우리는 훈련 수명 주기 전반에 걸쳐 p를 점진적으로 스케줄링하는 동적 어닐링 알고리즘으로 이 프레임워크를 구체화한다. 광범위한 평가는 기존 기준선 대비 우수한 안정성과 수렴성을 입증한다. 특히, 우리의 접근 방식은 여러 수학적 벤치마크에서 최첨단 평균 정확도인 54.9%를 달성하여 표준 GRPO 대비 7.2%의 실질적인 상대적 이득을 얻었으며, ALFWorld에서는 93.8%라는 탁월한 성공률을 확보했다.

English

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter p, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger p concentrates the gradient to amplify sparse learning signals, whereas a smaller p strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules p across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of 54.9% across multiple mathematical benchmarks, yielding a substantial 7.2% relative gain over standard GRPO and secures an exceptional 93.8% success rate on ALFWorld.