超越最终答案:你的推理轨迹揭示的远比你想象的更多
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
April 29, 2025
作者: Hasan Abed Al Kader Hammoud, Hani Itani, Bernard Ghanem
cs.AI
摘要
大型语言模型(LLMs)通过逐步推理来解决复杂问题。标准的评估实践通常涉及生成完整的推理轨迹,并评估其最终结论中给出的答案的正确性。在本文中,我们质疑这种对最终答案的依赖,提出以下两个问题:最终答案是否可靠地代表了模型的最优结论?不同的推理路径能否产生不同的结果?为了回答这些问题,我们分析了中间推理步骤,称为“子思维”,并基于我们的发现提出了一种方法。我们的方法包括根据语言线索将推理轨迹分割成连续的子思维。我们首先提示模型从每个中间子思维的终点生成延续。我们从源自不同子思维的每个完整延续中提取潜在答案。我们发现,通过选择最频繁出现的答案(众数)来聚合这些答案,往往比仅依赖原始完整轨迹得出的答案显著提高准确性。分析来自不同子思维的答案之间的一致性,揭示了与模型信心和正确性相关的特征,这表明了识别不太可靠答案的潜力。我们在各种LLMs和具有挑战性的数学推理数据集(AIME2024和AIME2025)上的实验显示了一致的准确性提升,分别达到了13%和10%的增益。实现代码可在以下网址获取:https://github.com/hammoudhasan/SubthoughtReasoner。
English
Large Language Models (LLMs) leverage step-by-step reasoning to solve complex
problems. Standard evaluation practice involves generating a complete reasoning
trace and assessing the correctness of the final answer presented at its
conclusion. In this paper, we challenge the reliance on the final answer by
posing the following two questions: Does the final answer reliably represent
the model's optimal conclusion? Can alternative reasoning paths yield different
results? To answer these questions, we analyze intermediate reasoning steps,
termed subthoughts, and propose a method based on our findings. Our approach
involves segmenting a reasoning trace into sequential subthoughts based on
linguistic cues. We start by prompting the model to generate continuations from
the end-point of each intermediate subthought. We extract a potential answer
from every completed continuation originating from different subthoughts. We
find that aggregating these answers by selecting the most frequent one (the
mode) often yields significantly higher accuracy compared to relying solely on
the answer derived from the original complete trace. Analyzing the consistency
among the answers derived from different subthoughts reveals characteristics
that correlate with the model's confidence and correctness, suggesting
potential for identifying less reliable answers. Our experiments across various
LLMs and challenging mathematical reasoning datasets (AIME2024 and AIME2025)
show consistent accuracy improvements, with gains reaching up to 13\% and 10\%
respectively. Implementation is available at:
https://github.com/hammoudhasan/SubthoughtReasoner.