您的推理模型是否隱含地知道何時停止思考？

摘要

近期大型推理模型（LRMs）的技術進展，透過長鏈思維（CoTs）大幅提升了處理複雜推理任務的能力。然而，這種方法往往會產生大量冗餘，不僅影響計算效率，更在即時應用中造成顯著延遲。最新研究表明，過長的推理鏈與正確性經常缺乏關聯，甚至可能損及準確度。在對此現象的深入分析中，我們意外發現並通過實證驗證：LRMs實際上隱含知悉何時該停止思考的能力，但此能力被現行取樣模式所掩蓋。基於此洞見，我們提出SAGE（自我感知引導高效推理）這一創新取樣範式，釋放模型的潛在高效推理能力。更進一步，將SAGE作為混合取樣策略整合至基於群體的強化學習框架（SAGE-RL）後，SAGE-RL能有效將SAGE發現的高效推理模式融入標準pass@1推論流程，在多項高難度數學基準測試中顯著提升LRMs的推理準確度與效率。

English

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.

您的推理模型是否隱含地知道何時停止思考？

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

摘要

Support