ChatPaper.aiChatPaper

超越最終答案:你的推理軌跡揭示的遠超你所想

Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think

April 29, 2025
作者: Hasan Abed Al Kader Hammoud, Hani Itani, Bernard Ghanem
cs.AI

摘要

大型語言模型(LLMs)利用逐步推理來解決複雜問題。標準的評估實踐包括生成完整的推理軌跡,並評估其最終答案的正確性。在本文中,我們對依賴最終答案的做法提出質疑,並提出以下兩個問題:最終答案是否可靠地代表了模型的最佳結論?不同的推理路徑是否會產生不同的結果?為回答這些問題,我們分析了中間推理步驟(稱為子思維),並基於研究結果提出了一種方法。我們的方法涉及根據語言線索將推理軌跡分割成連續的子思維。我們首先提示模型從每個中間子思維的終點生成續寫。我們從不同子思維產生的完整續寫中提取潛在答案。我們發現,通過選擇最頻繁出現的答案(眾數)來聚合這些答案,通常比僅依賴原始完整軌跡得出的答案顯著提高了準確性。分析來自不同子思維的答案之間的一致性,揭示了與模型信心和正確性相關的特徵,這表明有可能識別出不太可靠的答案。我們在多種LLMs和具有挑戰性的數學推理數據集(AIME2024和AIME2025)上的實驗顯示,準確性一致提升,增益分別達到13%和10%。實現代碼可在以下網址獲取:https://github.com/hammoudhasan/SubthoughtReasoner。
English
Large Language Models (LLMs) leverage step-by-step reasoning to solve complex problems. Standard evaluation practice involves generating a complete reasoning trace and assessing the correctness of the final answer presented at its conclusion. In this paper, we challenge the reliance on the final answer by posing the following two questions: Does the final answer reliably represent the model's optimal conclusion? Can alternative reasoning paths yield different results? To answer these questions, we analyze intermediate reasoning steps, termed subthoughts, and propose a method based on our findings. Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues. We start by prompting the model to generate continuations from the end-point of each intermediate subthought. We extract a potential answer from every completed continuation originating from different subthoughts. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace. Analyzing the consistency among the answers derived from different subthoughts reveals characteristics that correlate with the model's confidence and correctness, suggesting potential for identifying less reliable answers. Our experiments across various LLMs and challenging mathematical reasoning datasets (AIME2024 and AIME2025) show consistent accuracy improvements, with gains reaching up to 13\% and 10\% respectively. Implementation is available at: https://github.com/hammoudhasan/SubthoughtReasoner.
PDF222May 4, 2025