通過推理塑形緩解過度思考
Mitigating Overthinking through Reasoning Shaping
October 10, 2025
作者: Feifan Song, Shaohang Wei, Bofei Gao, Yejie Wang, Wen Luo, Wei Li, Linli Yao, Weimin Xiong, Liang Chen, Tianyu Liu, Houfeng Wang
cs.AI
摘要
基於驗證器獎勵強化學習(RLVR)驅動的大型推理模型(LRMs)在解決問題方面展現了強大能力,然而它們常引發過度思考:冗長而曲折的推理過程導致計算成本膨脹。先前RLVR中的懲罰機制設計雖能減少令牌消耗,卻往往損害模型性能,這源於令牌級別監督的過於簡化。本文主張,監督的粒度在平衡效率與準確性中扮演關鍵角色,並提出了群組相對段落懲罰(GRSP),一種段落層面的推理正則化方法。初步分析顯示,推理段落與令牌消耗及模型性能高度相關,因此我們設計了一種跨段落集群的長度感知加權機制。大量實驗證明,GRSP在不嚴重影響準確性的前提下,實現了卓越的令牌效率,尤其在處理更難問題時優勢明顯。此外,GRSP穩定了RL訓練過程,並在不同模型規模下展現出良好的擴展性。
English
Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier
Reward (RLVR) have shown great power in problem solving, yet they often cause
overthinking: excessive, meandering reasoning that inflates computational cost.
Prior designs of penalization in RLVR manage to reduce token consumption while
often harming model performance, which arises from the oversimplicity of
token-level supervision. In this paper, we argue that the granularity of
supervision plays a crucial role in balancing efficiency and accuracy, and
propose Group Relative Segment Penalization (GRSP), a step-level method to
regularize reasoning. Since preliminary analyses show that reasoning segments
are strongly correlated with token consumption and model performance, we design
a length-aware weighting mechanism across segment clusters. Extensive
experiments demonstrate that GRSP achieves superior token efficiency without
heavily compromising accuracy, especially the advantages with harder problems.
Moreover, GRSP stabilizes RL training and scales effectively across model
sizes.