증명에서 프로그램으로: 대규모 언어 모델의 도구 유발 추론 환각 특성 분석

초록

도구 강화 언어 모델(TaLMs)은 매개변수적 능력을 넘어서는 문제를 해결하기 위해 외부 도구를 호출할 수 있습니다. 그러나 이러한 도구 활용으로 인한 성능 향상이 신뢰할 수 있는 추론을 반영하는지 여부는 여전히 불분명합니다. 본 연구는 코드 인터프리터 도구에 집중하여, 도구가 정확하게 선택되고 실행되는 경우에도 TaLMs가 도구 출력을 추론의 대체물로 취급하여 겉보기에 정확하지만 일관된 근거가 부족한 해결책을 도출함을 보여줍니다. 우리는 이러한 실패 모드를 도구 유발 근시(TIM)라고 명명하고, Python 코드가 유용하지만 충분하지 않은 1,679개의 경쟁 수준 수학 문제로 구성된 벤치마크인 PYMATH를 사용하여 이를 연구합니다. 또한 도구를 사용하지 않는 대조군 대비 TaLMs의 추론 성능 저하를 정량화하기 위한 다차원 평가 체계를 개발했습니다. 우리의 연구 결과에 따르면, TaLMs는 최종 정답 정확도에서 최대 19.3% 포인트의 향상을 달성하지만, 그 추론 행동은 지속적으로 저하됩니다(예: 추론 과정에 대한 pairwise 비교에서 도구를 사용하지 않는 LLMs가 최대 41.5% 더 많이 우승함). 이러한 저하는 도구 사용과 함께 심화됩니다. 모델이 도구를 더 자주 호출할수록 그 추론의 일관성은 더욱 떨어집니다. 더욱이 도구 사용은 오류를 산술적 실수에서 전역적 추론 실패(논리, 가정, 창의성)로 전이시키며, TIM은 약 55%의 고위험 사례에서 나타납니다. 마지막으로, 우리는 TaLMs가 도구를 보조 증거로 사용하도록 재조정하여 도구 사용 하에서 최종 정답 정확도와 추론 깊이를 모두 개선하는 선호 최적화 기반 프레임워크를 제안합니다. 코드와 데이터는 https://github.com/megagonlabs/TIM에서 이용 가능합니다.

English

Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.