您的推理模型是否隐式知晓何时停止思考?
Does Your Reasoning Model Implicitly Know When to Stop Thinking?
February 9, 2026
作者: Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuanda Wang, Zhixia Zhang, Hongyan Xie, Songshi Liang, Zehao Chen, Xuefeng Xiao, Fuzhen Zhuang, Jianxin Li, Yikun Ban, Deqing Wang
cs.AI
摘要
近期,大型推理模型(LRM)通过长链思维(CoT)在复杂推理任务上的能力显著提升。然而,该方法常导致大量冗余,影响计算效率并造成实时应用中的显著延迟。最新研究表明,过长的推理链往往与正确性无关,甚至可能损害准确性。在对该现象进一步深入分析时,我们意外地发现并通过实验验证:LRM实际上隐式地知道何时应停止思考,但这种能力被当前采样范式所掩盖。受此启发,我们提出SAGE(自我感知引导高效推理)这一新型采样范式,以释放这种高效推理潜力。此外,将SAGE作为混合采样策略整合至基于群体的强化学习中(SAGE-RL),可使SAGE-RL有效将SAGE发现的高效推理模式融入标准pass@1推理,在多个高难度数学基准测试中显著提升LRM的推理准确性与效率。
English
Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.