ChatPaper.aiChatPaper

初次嘗試至關重要:重新審視反思在推理模型中的角色

First Try Matters: Revisiting the Role of Reflection in Reasoning Models

October 9, 2025
作者: Liwei Kang, Yue Deng, Yao Xiao, Zhanfeng Mo, Wee Sun Lee, Lidong Bing
cs.AI

摘要

近期,大型語言模型在推理能力上展現了顯著的進步,這通常歸因於其能夠生成更長的思維鏈並進行反思性推理。然而,反思對性能提升的具體貢獻仍不明確。在本論文中,我們系統地分析了八個推理模型在五個數學數據集上的運行過程。我們特別關注模型在已經生成答案後仍繼續反思的行為,直至最終確定輸出。我們的分析顯示,反思主要起到確認作用,很少改變模型的初始答案,這一模式在模型和數據集之間具有一致性。為了理解反思在訓練中的作用,我們構建了包含不同反思步數的監督微調(SFT)數據集。我們觀察到,在包含更多反思步數的運行數據上訓練模型,主要提升了首次答案的正確性,而非通過反思修正初始錯誤答案的能力。這促使我們提出了一種基於問題的早期停止方法,該方法在推理過程中一旦生成幾個合理的候選答案便停止,從而減少不必要的反思步數,提升推理時的令牌效率。基於此,我們進一步提出在生成過程中候選答案出現後動態截斷反思,這在五個數學數據集上減少了24.5%的推理令牌,而準確率僅下降2.9%。
English
Large language models have recently demonstrated significant gains in reasoning ability, often attributed to their capacity to generate longer chains of thought and engage in reflective reasoning. However, the contribution of reflections to performance improvement remains unclear. In this paper, we systematically analyze the rollouts of eight reasoning models on five mathematical datasets. We focus on reflective behaviours where the model has already produced an answer but continues reflecting before finalizing its output. Our analysis reveals that reflections are predominantly confirmatory and rarely alter the model's initial answer, a pattern consistent across models and datasets. To understand the role of reflections in training, we construct supervised fine-tuning (SFT) datasets with varying amounts of reflection steps. We observe that training models on rollouts with more reflection steps primarily enhances first-answer correctness rather than the ability to correct initially wrong answers through reflections. This motivates us to propose a question-aware early-stopping method that enhances inference-time token efficiency by stopping the reasoning process once a few plausible candidate answers are generated, thereby reducing unnecessary reflection steps. Motivated by this, we further propose to dynamically truncate the reflections after a candidate answer has appeared during generation, which reduces reasoning tokens by 24.5% across five mathematical datasets, within a 2.9% drop in accuracy.
PDF222October 10, 2025