ASPO: 非対称重要度サンプリング方策最適化

要旨

近年の大規模言語モデル（LLM）の事後学習手法は、強化学習（RL）中のトークンレベルのクリッピングメカニズムに依存している。しかし、我々はこの結果監視型強化学習（OSRL）パラダイムにおける根本的な欠陥を特定した：正のアドバンテージを持つトークンの重要度サンプリング（IS）比率が不整合であり、正と負のトークンに対する重み付けが不均衡になる。この不整合は、低確率のトークンの更新を抑制し、既に高確率のトークンを過剰に増幅する。これを解決するため、我々は非対称重要度サンプリングポリシー最適化（ASPO）を提案する。ASPOは、正のアドバンテージを持つトークンのIS比率を反転させるというシンプルかつ効果的な戦略を用いて、それらの更新方向を負のトークンの学習ダイナミクスと整合させる。さらに、ASPOは極端な更新を安定化しつつ勾配の流れを維持するためのソフトデュアルクリッピングメカニズムを組み込んでいる。コーディングおよび数学的推論のベンチマークにおける包括的な実験により、ASPOが早期収束を大幅に緩和し、学習の安定性を向上させ、強力なGRPOベースのベースラインを上回る最終性能を達成することが示された。我々の分析は、OSRLにおけるトークンレベルの重み付けの役割に関する新たな洞察を提供し、LLM RLにおけるISの修正の重要性を強調している。ASPOのコードとモデルはhttps://github.com/wizard-III/Archer2.0で公開されている。

English

Recent Large Language Model (LLM) post-training methods rely on token-level clipping mechanisms during Reinforcement Learning (RL). However, we identify a fundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens. This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens, aligning their update direction with the learning dynamics of negative ones. AIS further incorporates a soft dual-clipping mechanism to stabilize extreme updates while maintaining gradient flow. Comprehensive experiments on coding and mathematical reasoning benchmarks demonstrate that ASPO significantly mitigates premature convergence, improves training stability, and enhances final performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting IS in LLM RL. The code and models of ASPO are available at https://github.com/wizard-III/Archer2.0.

ASPO: 非対称重要度サンプリング方策最適化

ASPO: Asymmetric Importance Sampling Policy Optimization

要旨

Support