あなたの推論モデルは、考えるのをいつ止めるべきかを暗黙的に知っていますか？

要旨

大規模推論モデル（LRM）の最近の進歩は、長い思考連鎖（CoT）を通じて複雑な推論タスクにおける能力を大幅に向上させてきた。しかし、このアプローチはしばしば大幅な冗長性を生み出し、計算効率を損ない、リアルタイムアプリケーションにおいて重大な遅延を引き起こす。最近の研究では、長い推論連鎖が正答率と無関係であることが多く、精度に悪影響を及ぼし得ることが示されている。この現象をさらに詳細に分析した結果、我々は驚くべきことに、LRMが暗黙的に思考を停止する適切なタイミングを知っている一方で、この能力が現在のサンプリング手法によって覆い隠されていることを実証的に明らかにした。この発見に動機付けられ、我々はこの効率的な推論の潜在能力を解放する新しいサンプリング手法であるSAGE（Self-Aware Guided Efficient Reasoning）を提案する。さらに、SAGEを混合サンプリングとしてグループベース強化学習（SAGE-RL）に統合することで、SAGE-RLはSAGEが発見した効率的な推論パターンを標準的なpass@1推論に効果的に組み込み、複数の難易度の高い数学的ベンチマークにおいてLRMの推論精度と効率の両方を著しく向上させることができる。

English

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.

あなたの推論モデルは、考えるのをいつ止めるべきかを暗黙的に知っていますか？

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

要旨

Support