剖析工具集成推理:一項實證研究與分析
Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis
August 21, 2025
作者: Yufeng Zhao, Junnan Liu, Hongwei Liu, Dongsheng Zhu, Yuan Shen, Songyang Zhang, Kai Chen
cs.AI
摘要
大型語言模型(LLMs)在推理任務上取得了顯著進展,尤其是通過思維鏈(CoT)推理等方法。然而,在需要精確計算的任務中,它們往往表現不佳。工具集成推理(TIR)作為一種解決方案應運而生,通過將外部工具整合到推理過程中來提升性能。然而,TIR在提升LLM推理能力方面的泛化效果仍不明確。此外,TIR是否改善了模型的推理行為並幫助模型更好地思考,仍有待研究。我們引入了ReasonZoo,這是一個涵蓋九種不同推理類別的綜合基準,用於評估TIR在各個領域中的有效性。此外,我們提出了兩個新指標——性能感知成本(PAC)和性能-成本曲線下面積(AUC-PCC)——來評估推理效率。我們的實證評估表明,啟用TIR的模型在數學和非數學任務中均持續優於未啟用TIR的模型。此外,TIR提升了推理效率,這體現在改進的PAC和AUC-PCC上,表明減少了過度思考並實現了更為精簡的推理。這些發現凸顯了TIR的領域通用優勢及其在提升LLM處理複雜推理任務能力方面的潛力。
English
Large Language Models (LLMs) have made significant strides in reasoning tasks
through methods like chain-of-thought (CoT) reasoning. However, they often fall
short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR)
has emerged as a solution by incorporating external tools into the reasoning
process. Nevertheless, the generalization of TIR in improving the reasoning
ability of LLM is still unclear. Additionally, whether TIR has improved the
model's reasoning behavior and helped the model think remains to be studied. We
introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse
reasoning categories, to evaluate the effectiveness of TIR across various
domains. Additionally, we propose two novel metrics, Performance-Aware Cost
(PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning
efficiency. Our empirical evaluation demonstrates that TIR-enabled models
consistently outperform their non-TIR counterparts in both mathematical and
non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as
evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more
streamlined reasoning. These findings underscore the domain-general benefits of
TIR and its potential to advance LLM capabilities in complex reasoning tasks.