均一から不均一へ：各トークンの特性に応じたポリシー最適化のカスタマイズ

要旨

強化学習は、LLM（大規模言語モデル）の推論能力を向上させるための基本的な技術として登場しました。しかし、既存のアルゴリズムはすべてのトークンに均一な最適化を適用しており、推論プロセスにおけるそれらの異なる役割を無視しています。この制約を解決するため、我々はHeterogeneous Adaptive Policy Optimization（HAPO）を提案します。これは、トークンのエントロピーに基づいて動的に最適化を適応させる包括的なトークン認識アルゴリズムです。ロールアウトサンプリングについては、Adaptive Temperature Samplingを提案し、サンプリング温度をリアルタイムで調整することで、高エントロピートークンでの探索を促進しつつ、低エントロピートークンでの一貫性を保ちます。アドバンテージ計算については、Token Level Group Averageを導入し、トークンレベルでアドバンテージを正規化し、シーケンス長を考慮しながらトークンレベルの損失を非偏りのある形で扱います。さらに、Differential Advantage Redistributionを開発し、エントロピーと重要度比率を活用して、明確なシグナルを持つトークンの報酬調整更新を変調します。クリッピング損失については、Asymmetric Adaptive Clippingを設計し、ノイズの多い低エントロピートークンに対しては積極的な確率削減を可能にしつつ、高エントロピートークンでは探索を可能にします。エントロピーとトレーニングダイナミクスの体系的な調査を通じて、我々はすべての段階にトークンレベルの処理を組み込み、きめ細かい制御を実現しました。広範な実験により、HAPOが複数のモデルスケールにおいてDAPOを一貫して上回ることが示されました。我々のコードはhttps://github.com/starriver030515/HAPOで公開されています。

English

Reinforcement Learning has emerged as the fundamental technique for enhancing reasoning in LLMs. However, existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process. To address this limitation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a comprehensive token-aware algorithm that dynamically adapts optimization based on token entropy. For rollout sampling, we propose Adaptive Temperature Sampling, which adjusts sampling temperature in real time, promoting exploration at high-entropy tokens while preserving coherence at low-entropy ones. For advantage calculation, we introduce Token Level Group Average that normalizes advantages at token level, jointly accounting for sequence-length as in token-mean loss while preserving non-biased treatment. We then develop Differential Advantage Redistribution that leverages entropy and importance ratios to modulate rewards-adjusting updates for tokens with clear signals. For clipping loss, we design Asymmetric Adaptive Clipping, allowing aggressive probability reduction for noisy low-entropy tokens while enabling exploration for high-entropy tokens. Through systematic investigation between entropy and training dynamics, we embedded token-level treatment into every stages to achieve fine-grained control. Extensive experiments demonstrate that HAPO consistently outperforms DAPO across multiple model scales. Our code can be found in https://github.com/starriver030515/HAPO.

均一から不均一へ：各トークンの特性に応じたポリシー最適化のカスタマイズ

From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature

要旨

Support