剖析工具集成推理：一項實證研究與分析

摘要

大型語言模型（LLMs）在推理任務上取得了顯著進展，尤其是通過思維鏈（CoT）推理等方法。然而，在需要精確計算的任務中，它們往往表現不佳。工具集成推理（TIR）作為一種解決方案應運而生，通過將外部工具整合到推理過程中來提升性能。然而，TIR在提升LLM推理能力方面的泛化效果仍不明確。此外，TIR是否改善了模型的推理行為並幫助模型更好地思考，仍有待研究。我們引入了ReasonZoo，這是一個涵蓋九種不同推理類別的綜合基準，用於評估TIR在各個領域中的有效性。此外，我們提出了兩個新指標——性能感知成本（PAC）和性能-成本曲線下面積（AUC-PCC）——來評估推理效率。我們的實證評估表明，啟用TIR的模型在數學和非數學任務中均持續優於未啟用TIR的模型。此外，TIR提升了推理效率，這體現在改進的PAC和AUC-PCC上，表明減少了過度思考並實現了更為精簡的推理。這些發現凸顯了TIR的領域通用優勢及其在提升LLM處理複雜推理任務能力方面的潛力。

English

Large Language Models (LLMs) have made significant strides in reasoning tasks through methods like chain-of-thought (CoT) reasoning. However, they often fall short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR) has emerged as a solution by incorporating external tools into the reasoning process. Nevertheless, the generalization of TIR in improving the reasoning ability of LLM is still unclear. Additionally, whether TIR has improved the model's reasoning behavior and helped the model think remains to be studied. We introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse reasoning categories, to evaluate the effectiveness of TIR across various domains. Additionally, we propose two novel metrics, Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning efficiency. Our empirical evaluation demonstrates that TIR-enabled models consistently outperform their non-TIR counterparts in both mathematical and non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more streamlined reasoning. These findings underscore the domain-general benefits of TIR and its potential to advance LLM capabilities in complex reasoning tasks.

剖析工具集成推理：一項實證研究與分析

Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis

摘要

Support