初试即关键:反思在推理模型中的角色再探
First Try Matters: Revisiting the Role of Reflection in Reasoning Models
October 9, 2025
作者: Liwei Kang, Yue Deng, Yao Xiao, Zhanfeng Mo, Wee Sun Lee, Lidong Bing
cs.AI
摘要
近期,大型语言模型在推理能力上展现出显著提升,这常归因于其能够生成长链条的思维过程并参与反思性推理。然而,反思对性能提升的具体贡献尚不明确。本文中,我们系统分析了八种推理模型在五个数学数据集上的推理轨迹,特别关注模型在已生成答案后仍继续反思直至最终输出的行为模式。分析表明,反思行为大多具有确认性质,极少改变模型的初始答案,这一现象在不同模型和数据集间保持一致。为探究反思在训练中的作用,我们构建了包含不同反思步骤数量的监督微调(SFT)数据集。研究发现,在包含更多反思步骤的推理轨迹上训练模型,主要提升了首次回答的正确率,而非通过反思纠正初始错误答案的能力。基于此,我们提出了一种问题感知的早停策略,该策略在推理过程中一旦生成若干可信候选答案即终止推理,从而减少不必要的反思步骤,提高推理时的token效率。进一步地,我们提出在生成过程中动态截断候选答案出现后的反思,这一方法在五个数学数据集上减少了24.5%的推理token,同时仅带来2.9%的准确率下降。
English
Large language models have recently demonstrated significant gains in
reasoning ability, often attributed to their capacity to generate longer chains
of thought and engage in reflective reasoning. However, the contribution of
reflections to performance improvement remains unclear. In this paper, we
systematically analyze the rollouts of eight reasoning models on five
mathematical datasets. We focus on reflective behaviours where the model has
already produced an answer but continues reflecting before finalizing its
output. Our analysis reveals that reflections are predominantly confirmatory
and rarely alter the model's initial answer, a pattern consistent across models
and datasets. To understand the role of reflections in training, we construct
supervised fine-tuning (SFT) datasets with varying amounts of reflection steps.
We observe that training models on rollouts with more reflection steps
primarily enhances first-answer correctness rather than the ability to correct
initially wrong answers through reflections. This motivates us to propose a
question-aware early-stopping method that enhances inference-time token
efficiency by stopping the reasoning process once a few plausible candidate
answers are generated, thereby reducing unnecessary reflection steps. Motivated
by this, we further propose to dynamically truncate the reflections after a
candidate answer has appeared during generation, which reduces reasoning tokens
by 24.5% across five mathematical datasets, within a 2.9% drop in accuracy.