効果的な推論を特徴づけるものとは？ CoTの長さ、レビュー、構造の再検討

要旨

大規模推論モデル（LRM）は、長い連鎖的思考（CoT）トレースに大量のテスト時間計算リソースを費やしますが、効果的なCoTを*特徴づける*要素は依然として不明確です。先行研究では、CoTを長くすることや、追加の*待機*トークンによるレビュー（以前のステップを再訪すること）から得られる利点が報告されていますが、最近の研究では、短い思考が長いトレースを上回る可能性が示唆されています。そこで、私たちは数学的および科学的推論において10のLRMにわたる体系的な評価を行いました。「長ければ長いほど良い」という通説に反し、単純なCoTの延長とレビューの増加は、*低い*精度と関連していることがわかりました。 CoTが段階的に展開される中で、トークンレベルの指標は冗長性とプロセスの質を混同する可能性があります。私たちは、CoTの構造を抽出し、モデル間で正しさを予測する際に長さやレビュー比率を一貫して上回る単一の統計量——*失敗ステップ率（FSF）*、つまり放棄された分岐におけるステップの割合——を特定するために、CoTのグラフビューを導入しました。因果関係を探るために、2つの介入を設計しました。まず、テスト時に各指標に基づいて候補CoTをランク付けし、FSFが最大のpass@1の向上をもたらすことを確認しました。次に、CoTを編集して失敗した分岐を削除すると、精度が大幅に向上し、失敗した分岐がその後の推論にバイアスをかけることが示されました。これらの結果を総合すると、効果的なCoTは*失敗が少ない*ものであり、無差別に長いCoTを生成するのではなく、*構造を意識した*テスト時間スケーリングを支持するものであることが特徴づけられます。

English

Large reasoning models (LRMs) spend substantial test-time compute on long chain-of-thought (CoT) traces, but what *characterizes* an effective CoT remains unclear. While prior work reports gains from lengthening CoTs and increasing review (revisiting earlier steps) via appended *wait* tokens, recent studies suggest that shorter thinking can outperform longer traces. We therefore conduct a systematic evaluation across ten LRMs on math and scientific reasoning. Contrary to the "longer-is-better" narrative, we find that both naive CoT lengthening and increased review are associated with *lower* accuracy. As CoT unfolds step by step, token-level metrics can conflate verbosity with process quality. We introduce a graph view of CoT to extract structure and identify a single statistic-the *Failed-Step Fraction (FSF)*, the fraction of steps in abandoned branches-that consistently outpredicts length and review ratio for correctness across models. To probe causality, we design two interventions. First, we rank candidate CoTs by each metric at test time, where FSF yields the largest pass@1 gains; second, we edit CoTs to remove failed branches, which significantly improves accuracy, indicating that failed branches bias subsequent reasoning. Taken together, these results characterize effective CoTs as those that *fail less* and support *structure-aware* test-time scaling over indiscriminately generating long CoT.

効果的な推論を特徴づけるものとは？ CoTの長さ、レビュー、構造の再検討

What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT

要旨

Support