剖析工具集成推理：一项实证研究与分析

摘要

大型语言模型（LLMs）在推理任务上取得了显著进展，尤其是通过思维链（CoT）推理等方法。然而，在需要精确计算的任务中，它们往往表现欠佳。工具集成推理（TIR）作为一种解决方案应运而生，它将外部工具融入推理过程。尽管如此，TIR在提升LLM推理能力方面的泛化效果仍不明确。此外，TIR是否改善了模型的推理行为并助力模型思考，也有待深入研究。我们引入了ReasonZoo，一个涵盖九种不同推理类别的综合基准，以评估TIR在各个领域的有效性。同时，我们提出了两个新颖的指标——性能感知成本（PAC）和性能-成本曲线下面积（AUC-PCC），用以评估推理效率。我们的实证评估表明，启用TIR的模型在数学和非数学任务中均持续优于未启用TIR的模型。此外，TIR提升了推理效率，这体现在改进的PAC和AUC-PCC上，表明减少了过度思考并实现了更为流畅的推理。这些发现强调了TIR的跨领域优势及其在推动LLM处理复杂推理任务能力方面的潜力。

English

Large Language Models (LLMs) have made significant strides in reasoning tasks through methods like chain-of-thought (CoT) reasoning. However, they often fall short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR) has emerged as a solution by incorporating external tools into the reasoning process. Nevertheless, the generalization of TIR in improving the reasoning ability of LLM is still unclear. Additionally, whether TIR has improved the model's reasoning behavior and helped the model think remains to be studied. We introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse reasoning categories, to evaluate the effectiveness of TIR across various domains. Additionally, we propose two novel metrics, Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning efficiency. Our empirical evaluation demonstrates that TIR-enabled models consistently outperform their non-TIR counterparts in both mathematical and non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more streamlined reasoning. These findings underscore the domain-general benefits of TIR and its potential to advance LLM capabilities in complex reasoning tasks.

剖析工具集成推理：一项实证研究与分析

Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis

摘要

Support