從統一至異質:針對每一語符本質量身定制策略優化
From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature
September 20, 2025
作者: Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, Wentao Zhang
cs.AI
摘要
強化學習已成為提升大型語言模型(LLMs)推理能力的基礎技術。然而,現有算法對所有詞元採用統一優化策略,忽視了它們在推理過程中的不同角色。為解決這一限制,我們引入了異質自適應策略優化(Heterogeneous Adaptive Policy Optimization, HAPO),這是一種全面的詞元感知算法,能根據詞元熵動態調整優化策略。在滾動採樣方面,我們提出了自適應溫度採樣,實時調整採樣溫度,促進高熵詞元的探索,同時保持低熵詞元的連貫性。對於優勢計算,我們引入了詞元級別組平均法,在詞元層面進行優勢歸一化,既考慮了序列長度如詞元平均損失,又保持了無偏處理。隨後,我們開發了差分優勢再分配,利用熵和重要性比率來調節具有明確信號詞元的獎勵調整更新。對於裁剪損失,我們設計了非對稱自適應裁剪,允許對噪聲低熵詞元進行激進的概率降低,同時為高熵詞元保留探索空間。通過系統研究熵與訓練動態之間的關係,我們將詞元級別處理嵌入到每個階段,以實現精細控制。大量實驗表明,HAPO在多個模型規模上均持續優於DAPO。我們的代碼可在https://github.com/starriver030515/HAPO找到。
English
Reinforcement Learning has emerged as the fundamental technique for enhancing
reasoning in LLMs. However, existing algorithms apply uniform optimization to
all tokens, ignoring their different roles in reasoning process. To address
this limitation, we introduce Heterogeneous Adaptive Policy Optimization
(HAPO), a comprehensive token-aware algorithm that dynamically adapts
optimization based on token entropy. For rollout sampling, we propose Adaptive
Temperature Sampling, which adjusts sampling temperature in real time,
promoting exploration at high-entropy tokens while preserving coherence at
low-entropy ones. For advantage calculation, we introduce Token Level Group
Average that normalizes advantages at token level, jointly accounting for
sequence-length as in token-mean loss while preserving non-biased treatment. We
then develop Differential Advantage Redistribution that leverages entropy and
importance ratios to modulate rewards-adjusting updates for tokens with clear
signals. For clipping loss, we design Asymmetric Adaptive Clipping, allowing
aggressive probability reduction for noisy low-entropy tokens while enabling
exploration for high-entropy tokens. Through systematic investigation between
entropy and training dynamics, we embedded token-level treatment into every
stages to achieve fine-grained control. Extensive experiments demonstrate that
HAPO consistently outperforms DAPO across multiple model scales. Our code can
be found in https://github.com/starriver030515/HAPO.