从证明到程序:揭示大语言模型中工具引发的推理幻觉
From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models
November 14, 2025
作者: Farima Fatahi Bayat, Pouya Pezeshkpour, Estevam Hruschka
cs.AI
摘要
工具增强型语言模型(TaLMs)能够调用外部工具以解决超出其参数化能力的问题。然而,这些工具带来的性能提升是否反映可信推理仍不明确。本文聚焦代码解释器工具,发现即使工具被正确选择和执行,TaLMs仍会将工具输出视为推理的替代品,生成看似正确但缺乏连贯论证的解决方案。我们将这种失效模式称为"工具诱发短视",并基于PYMATH基准(包含1,679个竞赛级数学问题,其中Python代码有助但非充分条件)展开研究。我们进一步开发了多维度评估体系,量化TaLMs相较于无工具对照模型的推理退化现象。研究显示:虽然TaLMs在最终答案准确率上最高提升19.3个百分点,但其推理行为持续恶化(例如在推理过程双盲比较中,无工具LLMs胜出率最高提升41.5%)。这种退化随工具使用频次增加而加剧:模型调用工具越频繁,其推理连贯性越差。此外,工具使用使错误类型从算术失误转向全局推理失败(逻辑、假设、创新性错误),约55%的高风险案例存在工具诱发短视现象。最后,我们提出基于偏好优化的对齐框架,引导TaLMs将工具作为辅助证据使用,在提升最终答案准确率的同时增强工具使用下的推理深度。代码与数据详见:https://github.com/megagonlabs/TIM。
English
Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.