从统一到异构：为每个词元的特性定制策略优化

摘要

强化学习已成为提升大语言模型（LLMs）推理能力的核心技术。然而，现有算法对所有令牌采用统一的优化策略，忽视了它们在推理过程中的不同作用。针对这一局限，我们提出了异构自适应策略优化（Heterogeneous Adaptive Policy Optimization, HAPO），这是一种全面的令牌感知算法，能够根据令牌熵动态调整优化策略。在采样阶段，我们提出了自适应温度采样（Adaptive Temperature Sampling），实时调整采样温度，促进高熵令牌的探索，同时保持低熵令牌的连贯性。在优势计算方面，我们引入了令牌级别组平均（Token Level Group Average），在令牌层面归一化优势值，既考虑了序列长度（如令牌平均损失），又确保了无偏处理。随后，我们开发了差分优势重分配（Differential Advantage Redistribution），利用熵和重要性比率来调节奖励更新，针对具有明确信号的令牌进行调整。对于裁剪损失，我们设计了非对称自适应裁剪（Asymmetric Adaptive Clipping），允许对噪声低熵令牌进行激进的概率削减，同时为高熵令牌保留探索空间。通过对熵与训练动态的系统性研究，我们将令牌级处理嵌入到每个阶段，实现了精细控制。大量实验表明，HAPO在多个模型规模上均持续优于DAPO。我们的代码可在https://github.com/starriver030515/HAPO找到。

English

Reinforcement Learning has emerged as the fundamental technique for enhancing reasoning in LLMs. However, existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process. To address this limitation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a comprehensive token-aware algorithm that dynamically adapts optimization based on token entropy. For rollout sampling, we propose Adaptive Temperature Sampling, which adjusts sampling temperature in real time, promoting exploration at high-entropy tokens while preserving coherence at low-entropy ones. For advantage calculation, we introduce Token Level Group Average that normalizes advantages at token level, jointly accounting for sequence-length as in token-mean loss while preserving non-biased treatment. We then develop Differential Advantage Redistribution that leverages entropy and importance ratios to modulate rewards-adjusting updates for tokens with clear signals. For clipping loss, we design Asymmetric Adaptive Clipping, allowing aggressive probability reduction for noisy low-entropy tokens while enabling exploration for high-entropy tokens. Through systematic investigation between entropy and training dynamics, we embedded token-level treatment into every stages to achieve fine-grained control. Extensive experiments demonstrate that HAPO consistently outperforms DAPO across multiple model scales. Our code can be found in https://github.com/starriver030515/HAPO.

从统一到异构：为每个词元的特性定制策略优化

From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature

摘要

Support