ChatPaper.aiChatPaper

从证明到程序:揭示大型语言模型中工具引发的推理幻觉

From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

November 14, 2025
作者: Farima Fatahi Bayat, Pouya Pezeshkpour, Estevam Hruschka
cs.AI

摘要

工具增强型语言模型能够调用外部工具来解决超出其参数能力范围的问题。然而,这些工具带来的性能提升是否反映可信推理仍不明确。本研究以代码解释器工具为焦点,发现即使工具被正确选择和执行,工具增强型语言模型仍会将工具输出视为推理的替代品,生成看似正确但缺乏连贯论证的解决方案。我们将这种失效模式称为"工具诱导短视",并基于PYMATH基准(包含1,679个竞赛级数学问题,其中Python代码有助但非充分条件)展开研究。我们进一步开发了多维评估体系,量化工具增强型语言模型相较于无工具版本的推理退化程度。研究结果表明:虽然工具增强型语言模型的最终答案准确率最高提升19.3个百分点,但其推理行为持续恶化(例如在推理过程的双盲比较中,无工具大型语言模型的胜出率最高可提升41.5%)。这种退化随工具使用频次增加而加剧:模型调用工具越频繁,其推理连贯性越差。此外,工具使用会使错误类型从算术失误转向全局推理失败(逻辑、假设、创造性错误),约55%的高风险案例中存在工具诱导短视现象。最后,我们提出基于偏好优化的对齐框架,使工具增强型语言模型能将工具作为辅助证据使用,从而在工具应用场景下同步提升最终答案准确率与推理深度。代码与数据详见:https://github.com/megagonlabs/TIM。
English
Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.
PDF32December 1, 2025