從統一至異質：針對每一語符本質量身定制策略優化

摘要

強化學習已成為提升大型語言模型（LLMs）推理能力的基礎技術。然而，現有算法對所有詞元採用統一優化策略，忽視了它們在推理過程中的不同角色。為解決這一限制，我們引入了異質自適應策略優化（Heterogeneous Adaptive Policy Optimization, HAPO），這是一種全面的詞元感知算法，能根據詞元熵動態調整優化策略。在滾動採樣方面，我們提出了自適應溫度採樣，實時調整採樣溫度，促進高熵詞元的探索，同時保持低熵詞元的連貫性。對於優勢計算，我們引入了詞元級別組平均法，在詞元層面進行優勢歸一化，既考慮了序列長度如詞元平均損失，又保持了無偏處理。隨後，我們開發了差分優勢再分配，利用熵和重要性比率來調節具有明確信號詞元的獎勵調整更新。對於裁剪損失，我們設計了非對稱自適應裁剪，允許對噪聲低熵詞元進行激進的概率降低，同時為高熵詞元保留探索空間。通過系統研究熵與訓練動態之間的關係，我們將詞元級別處理嵌入到每個階段，以實現精細控制。大量實驗表明，HAPO在多個模型規模上均持續優於DAPO。我們的代碼可在https://github.com/starriver030515/HAPO找到。

English

Reinforcement Learning has emerged as the fundamental technique for enhancing reasoning in LLMs. However, existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process. To address this limitation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a comprehensive token-aware algorithm that dynamically adapts optimization based on token entropy. For rollout sampling, we propose Adaptive Temperature Sampling, which adjusts sampling temperature in real time, promoting exploration at high-entropy tokens while preserving coherence at low-entropy ones. For advantage calculation, we introduce Token Level Group Average that normalizes advantages at token level, jointly accounting for sequence-length as in token-mean loss while preserving non-biased treatment. We then develop Differential Advantage Redistribution that leverages entropy and importance ratios to modulate rewards-adjusting updates for tokens with clear signals. For clipping loss, we design Asymmetric Adaptive Clipping, allowing aggressive probability reduction for noisy low-entropy tokens while enabling exploration for high-entropy tokens. Through systematic investigation between entropy and training dynamics, we embedded token-level treatment into every stages to achieve fine-grained control. Extensive experiments demonstrate that HAPO consistently outperforms DAPO across multiple model scales. Our code can be found in https://github.com/starriver030515/HAPO.

從統一至異質：針對每一語符本質量身定制策略優化

From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature

摘要

Support