通过推理塑形缓解过度思考
Mitigating Overthinking through Reasoning Shaping
October 10, 2025
作者: Feifan Song, Shaohang Wei, Bofei Gao, Yejie Wang, Wen Luo, Wei Li, Linli Yao, Weimin Xiong, Liang Chen, Tianyu Liu, Houfeng Wang
cs.AI
摘要
基于验证器奖励强化学习(RLVR)驱动的大型推理模型(LRMs)在问题解决方面展现出强大能力,然而它们常引发过度思考:即冗长曲折的推理过程,导致计算成本膨胀。先前RLVR中的惩罚机制设计虽能减少令牌消耗,却往往损害模型性能,这源于令牌级监督的过于简化。本文主张,监督的粒度在平衡效率与准确性中扮演关键角色,并提出了一种步骤级的推理正则化方法——组相对片段惩罚(GRSP)。初步分析表明,推理片段与令牌消耗及模型性能高度相关,因此我们设计了一种跨片段集群的长度感知加权机制。大量实验证实,GRSP在不显著牺牲准确性的前提下实现了更优的令牌效率,尤其在处理更复杂问题时优势明显。此外,GRSP稳定了RL训练过程,并在不同模型规模上展现出良好的扩展性。
English
Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier
Reward (RLVR) have shown great power in problem solving, yet they often cause
overthinking: excessive, meandering reasoning that inflates computational cost.
Prior designs of penalization in RLVR manage to reduce token consumption while
often harming model performance, which arises from the oversimplicity of
token-level supervision. In this paper, we argue that the granularity of
supervision plays a crucial role in balancing efficiency and accuracy, and
propose Group Relative Segment Penalization (GRSP), a step-level method to
regularize reasoning. Since preliminary analyses show that reasoning segments
are strongly correlated with token consumption and model performance, we design
a length-aware weighting mechanism across segment clusters. Extensive
experiments demonstrate that GRSP achieves superior token efficiency without
heavily compromising accuracy, especially the advantages with harder problems.
Moreover, GRSP stabilizes RL training and scales effectively across model
sizes.