CRITICTOOL:评估大语言模型在工具调用错误场景中的自我批判能力
CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios
June 11, 2025
作者: Shiting Huang, Zhen Fang, Zehui Chen, Siyu Yuan, Junjie Ye, Yu Zeng, Lin Chen, Qi Mao, Feng Zhao
cs.AI
摘要
大型语言模型(LLMs)利用外部工具的能力使其能够应对日益多样化的任务。然而,随着任务变得更加复杂且长期化,复杂的工具使用过程可能引发各种意外错误。因此,如何有效处理这些错误,包括识别、诊断和从中恢复,已成为推动工具学习发展的关键研究方向。在本研究中,我们首先深入分析了在多个竞争性工具评估基准上函数调用过程中遇到的错误类型。基于此,我们引入了CRITICTOOL,一个专为工具学习设计的全面批判评估基准。通过采用一种新颖的数据集构建进化策略,CRITICTOOL涵盖了不同复杂度的多样化工具使用错误,更好地反映了现实世界场景。我们在CRITICTOOL上进行了广泛的实验,验证了所构建基准策略的泛化性和有效性。同时,我们还深入分析了不同LLMs在工具反思能力上的表现,为LLMs工具学习领域提供了新的视角。代码可在https://github.com/Shellorley0513/CriticTool{https://github.com/Shellorley0513/CriticTool}获取。
English
The ability of large language models (LLMs) to utilize external tools has
enabled them to tackle an increasingly diverse range of tasks. However, as the
tasks become more complex and long-horizon, the intricate tool utilization
process may trigger various unexpected errors. Therefore, how to effectively
handle such errors, including identifying, diagnosing, and recovering from
them, has emerged as a key research direction for advancing tool learning. In
this work, we first extensively analyze the types of errors encountered during
the function-calling process on several competitive tool evaluation benchmarks.
Based on it, we introduce CRITICTOOL, a comprehensive critique evaluation
benchmark specialized for tool learning. Building upon a novel evolutionary
strategy for dataset construction, CRITICTOOL holds diverse tool-use errors
with varying complexities, which better reflects real-world scenarios. We
conduct extensive experiments on CRITICTOOL, and validate the generalization
and effectiveness of our constructed benchmark strategy. We also provide an
in-depth analysis of the tool reflection ability on various LLMs, offering a
new perspective on the field of tool learning in LLMs. The code is available at
https://github.com/Shellorley0513/CriticTool{https://github.com/Shellorley0513/CriticTool}.