推論収束時の停止：推論モデルのための意味保存型早期終了

要旨

大規模推論モデル（LRM）は、長い思考連鎖（CoT）を生成することで優れた性能を発揮するが、解決策が安定化した後も推論を続ける「過剰思考」に陥ることが多く、その結果トークンを無駄に消費しレイテンシを増大させる。既存の推論時早期終了手法は、主に信頼度や試行回答の一貫性といった回答レベル信号に依存して停止タイミングを決定する。しかし、これらの信号は主に回答準備性を反映するものであり、推論の収束を反映するものではない。そのため、モデルが探索や自己修正を完了する前にトリガーされ、早期終了を引き起こし、最終回答の精度を低下させるとともに、保持された推論連鎖を意味的に不完全なままにする可能性がある。本稿では、推論レベルの意味的冗長性を、意味保存的早期終了のための相補的信号として特定する。連続するステップが新たな進展をもたらさず、確立された結論を再訪するようになった場合、推論軌道は収束した可能性が高い。この知見に基づき、我々はPUMAを提案する。これは、軽量な冗長性検出器と回答レベル検証を組み合わせたプラグアンドプレイフレームワークである。検出器は意味的に冗長な候補終了点をフラグし、検証は停止が安全かどうかを確認する。これにより、PUMAは回答精度と一貫性のある推論プレフィックスを維持しつつ、冗長な継続部分を除去する。5つのLRMと5つの難易度の高い推論ベンチマークにおいて、PUMAは精度と保持されたCoT品質を維持しながら、平均26.2%のトークン削減を達成した。さらに、コード生成、ゼロショット視覚言語推論、学習された停止ポリシーの内面化に関する追加実験により、推論レベルの冗長性が効率的な推論のためのロバストで転送可能かつ学習可能な信号であることが示された。コードはhttps://github.com/giovanni-vaccarino/PUMAで公開している。

English

Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically incomplete. We identify reasoning-level semantic redundancy as a complementary signal for semantic-preserving early exit: when successive steps no longer add novel progress and instead revisit established conclusions, the reasoning trajectory has likely converged. Building on this insight, we propose PUMA, a plug-and-play framework that combines a lightweight Redundancy Detector with answer-level verification. The detector flags semantically redundant candidate exits, while verification confirms whether stopping is safe, allowing PUMA to remove redundant continuation while preserving both answer accuracy and a coherent reasoning prefix. Across five LRMs and five challenging reasoning benchmarks, PUMA achieves 26.2% average token reduction while preserving accuracy and retained CoT quality. Additional experiments on code generation, zero-shot vision-language reasoning, and learned stopping-policy internalization further demonstrate that reasoning-level redundancy is a robust, transferable, and learnable signal for efficient reasoning. Our code is available at https://github.com/giovanni-vaccarino/PUMA.