最初の試行が重要：推論モデルにおけるリフレクションの役割の再考

要旨

大規模言語モデルは最近、推論能力において著しい進歩を示しており、これは主に長い思考連鎖を生成し、内省的な推論を行う能力に起因すると考えられています。しかし、内省が性能向上にどの程度寄与しているかは依然として不明確です。本論文では、5つの数学データセットにおける8つの推論モデルの展開を系統的に分析します。特に、モデルが既に答えを生成した後も出力を確定する前に内省を続ける行動に焦点を当てます。分析の結果、内省は主に確認的であり、モデルの初期回答を変更することは稀であることが明らかになりました。このパターンはモデルやデータセットを問わず一貫しています。内省のトレーニングにおける役割を理解するため、異なる数の内省ステップを含む教師ありファインチューニング（SFT）データセットを構築しました。より多くの内省ステップを含む展開でモデルをトレーニングすると、主に最初の回答の正確性が向上し、内省を通じて最初に間違った回答を修正する能力はあまり向上しないことが観察されました。これを受けて、推論プロセスを数個の妥当な候補回答が生成された時点で停止することで、不要な内省ステップを削減し、推論時のトークン効率を向上させる質問認識型早期停止法を提案します。さらに、生成中に候補回答が出現した後に内省を動的に切り詰めることを提案し、これにより5つの数学データセットにおいて推論トークンを24.5%削減し、精度の低下は2.9%以内に抑えることができました。

English

Large language models have recently demonstrated significant gains in reasoning ability, often attributed to their capacity to generate longer chains of thought and engage in reflective reasoning. However, the contribution of reflections to performance improvement remains unclear. In this paper, we systematically analyze the rollouts of eight reasoning models on five mathematical datasets. We focus on reflective behaviours where the model has already produced an answer but continues reflecting before finalizing its output. Our analysis reveals that reflections are predominantly confirmatory and rarely alter the model's initial answer, a pattern consistent across models and datasets. To understand the role of reflections in training, we construct supervised fine-tuning (SFT) datasets with varying amounts of reflection steps. We observe that training models on rollouts with more reflection steps primarily enhances first-answer correctness rather than the ability to correct initially wrong answers through reflections. This motivates us to propose a question-aware early-stopping method that enhances inference-time token efficiency by stopping the reasoning process once a few plausible candidate answers are generated, thereby reducing unnecessary reflection steps. Motivated by this, we further propose to dynamically truncate the reflections after a candidate answer has appeared during generation, which reduces reasoning tokens by 24.5% across five mathematical datasets, within a 2.9% drop in accuracy.

最初の試行が重要：推論モデルにおけるリフレクションの役割の再考

First Try Matters: Revisiting the Role of Reflection in Reasoning Models

要旨

Support