终结者：学习思维链推理中的最优早停点

摘要

大型推理模型（LRMs）通过思维链（CoT）推理在复杂推理任务中表现出色，该机制使其能够在得出最终答案前生成中间思考标记。然而，LRMs常存在严重过度思考现象，即使在答案已提前生成后仍消耗过多计算时间。已有研究指出存在最优推理长度，在此节点截断推理可显著缩短CoT输出且几乎不影响性能。但由于最优CoT长度完全取决于具体任务和模型特性，针对实际数据集确定该长度极具挑战性。本文精准应对该问题，设计了推理阶段早期退出框架TERMINATOR以缓解过度思考。TERMINATOR的核心思想在于：LRM首次出现最终答案的时刻往往可预测，我们利用这些初始答案位置构建了最优推理长度的新型数据集来训练该框架。基于此方法，TERMINATOR在MATH-500、AIME 2025、HumanEval和GPQA四个高难度实际数据集上平均实现CoT长度14%-55%的显著缩减，同时性能超越当前最先进方法。

English

Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. In this paper, we precisely address this and design TERMINATOR, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning TERMINATOR is that the first arrival of an LRM's final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train TERMINATOR. Powered by this approach, TERMINATOR achieves significant reductions in CoT lengths of 14%-55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, whilst outperforming current state-of-the-art methods.