PEAR:面向高效推理的相位熵感知奖励机制
PEAR: Phase Entropy Aware Reward for Efficient Reasoning
October 9, 2025
作者: Chen Huang, Wei Lu, Wenxuan Zhang
cs.AI
摘要
大型推理模型(LRMs)通过生成详细的思维链(CoT)解释,在复杂推理任务中取得了令人瞩目的表现。然而,这些响应往往过于冗长,包含冗余的推理步骤,不仅增加了推理成本,还降低了实用性。在保持准确性的同时控制生成推理的长度,仍是一个待解决的挑战。通过系统的实证分析,我们发现不同LRMs在不同推理阶段的模型熵与响应长度之间存在一致的正相关关系:思考阶段表现出较高的熵,反映了较长响应的探索性行为,而最终答案阶段则显示出较低的熵,表明解决方案更具确定性。这一观察表明,不同推理阶段的熵可以作为平衡简洁性与性能的控制手段。基于这一洞见,本文引入了阶段熵感知奖励(PEAR),一种将阶段依赖性熵纳入奖励设计的机制。PEAR不再统一对待所有标记,而是在思考阶段惩罚过高的熵,在最终答案阶段允许适度的探索,从而鼓励模型生成既简洁又保留足够灵活性以正确解决任务的推理轨迹。这使得无需依赖明确的长度目标或严格的截断规则,即可实现响应长度的自适应控制。在四个基准测试上的广泛实验表明,PEAR在保持模型规模间竞争力的准确性的同时,持续减少了响应长度。此外,PEAR在训练分布之外也展现出强大的分布外(OOD)鲁棒性。我们的代码已公开于:https://github.com/iNLP-Lab/PEAR。
English
Large Reasoning Models (LRMs) have achieved impressive performance on complex
reasoning tasks by generating detailed chain-of-thought (CoT) explanations.
However, these responses are often excessively long, containing redundant
reasoning steps that inflate inference cost and reduce usability. Controlling
the length of generated reasoning without sacrificing accuracy remains an open
challenge. Through a systematic empirical analysis, we reveal a consistent
positive correlation between model entropy and response length at different
reasoning stages across diverse LRMs: the thinking phase exhibits higher
entropy, reflecting exploratory behavior of longer responses, while the final
answer phase shows lower entropy, indicating a more deterministic solution.
This observation suggests that entropy at different reasoning stages can serve
as a control knob for balancing conciseness and performance. Based on this
insight, this paper introduces Phase Entropy Aware Reward (PEAR), a reward
mechanism that incorporating phase-dependent entropy into the reward design.
Instead of treating all tokens uniformly, PEAR penalize excessive entropy
during the thinking phase and allowing moderate exploration at the final answer
phase, which encourages models to generate concise reasoning traces that retain
sufficient flexibility to solve the task correctly. This enables adaptive
control of response length without relying on explicit length targets or rigid
truncation rules. Extensive experiments across four benchmarks demonstrate that
PEAR consistently reduces response length while sustaining competitive accuracy
across model scales. In addition, PEAR demonstrates strong out-of-distribution
(OOD) robustness beyond the training distribution. Our code is available at:
https://github.com/iNLP-Lab/PEAR.