何谓有效推理？重新审视思维链的长度、回顾与结构

What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT

September 23, 2025

作者: Yunzhen Feng, Julia Kempe, Cheng Zhang, Parag Jain, Anthony Hartshorn

cs.AI

摘要

大型推理模型（LRMs）在测试时耗费大量计算资源于冗长的思维链（CoT）追踪上，然而，何种特性构成有效的CoT仍不明确。尽管先前的研究报告称，通过延长CoT和增加回顾（即重新审视早期步骤）——借助附加的“等待”标记——可带来性能提升，但近期研究却表明，较短的思考过程可能优于较长的追踪。因此，我们针对数学与科学推理领域，对十种LRMs进行了系统性评估。与“越长越好”的普遍观点相反，我们发现，无论是简单的CoT延长还是增加回顾，均与准确率的*降低*相关联。随着CoT逐步展开，基于标记级别的度量标准可能会将冗长与过程质量混为一谈。为此，我们引入了一种CoT的图结构视图，以提取其结构并识别出一个单一统计量——*失败步骤比例（FSF）*，即被放弃分支中步骤所占的比例——该指标在预测模型正确性方面，持续优于长度和回顾比例。为了探究因果关系，我们设计了两项干预措施。首先，在测试时根据各指标对候选CoT进行排序，其中FSF带来了最大的pass@1增益；其次，我们编辑CoT以移除失败分支，此举显著提高了准确率，表明失败分支会干扰后续推理。综合这些结果，我们总结出有效CoT的特征在于*失败更少*，并支持在测试时进行*结构感知*的扩展，而非不加选择地生成长CoT。

English

Large reasoning models (LRMs) spend substantial test-time compute on long chain-of-thought (CoT) traces, but what *characterizes* an effective CoT remains unclear. While prior work reports gains from lengthening CoTs and increasing review (revisiting earlier steps) via appended *wait* tokens, recent studies suggest that shorter thinking can outperform longer traces. We therefore conduct a systematic evaluation across ten LRMs on math and scientific reasoning. Contrary to the "longer-is-better" narrative, we find that both naive CoT lengthening and increased review are associated with *lower* accuracy. As CoT unfolds step by step, token-level metrics can conflate verbosity with process quality. We introduce a graph view of CoT to extract structure and identify a single statistic-the *Failed-Step Fraction (FSF)*, the fraction of steps in abandoned branches-that consistently outpredicts length and review ratio for correctness across models. To probe causality, we design two interventions. First, we rank candidate CoTs by each metric at test time, where FSF yields the largest pass@1 gains; second, we edit CoTs to remove failed branches, which significantly improves accuracy, indicating that failed branches bias subsequent reasoning. Taken together, these results characterize effective CoTs as those that *fail less* and support *structure-aware* test-time scaling over indiscriminately generating long CoT.