ChatPaper.aiChatPaper

CRITICTOOL:評估大型語言模型在工具調用錯誤場景中的自我批判能力

CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios

June 11, 2025
作者: Shiting Huang, Zhen Fang, Zehui Chen, Siyu Yuan, Junjie Ye, Yu Zeng, Lin Chen, Qi Mao, Feng Zhao
cs.AI

摘要

大型語言模型(LLMs)利用外部工具的能力,使其能夠應對日益多樣化的任務。然而,隨著任務變得更加複雜且具有長遠性,繁瑣的工具使用過程可能引發各種意外錯誤。因此,如何有效處理這些錯誤,包括識別、診斷及從中恢復,已成為推進工具學習的關鍵研究方向。在本研究中,我們首先廣泛分析了在多個競爭性工具評估基準上,函數調用過程中遇到的錯誤類型。基於此,我們引入了CRITICTOOL,一個專為工具學習設計的全面批判評估基準。通過採用新穎的數據集構建進化策略,CRITICTOOL涵蓋了多樣化且複雜度各異的工具使用錯誤,更貼近現實場景。我們在CRITICTOOL上進行了廣泛實驗,驗證了所構建基準策略的泛化能力和有效性。同時,我們深入分析了不同LLMs在工具反思能力上的表現,為LLMs工具學習領域提供了新的視角。相關代碼已公開於https://github.com/Shellorley0513/CriticTool{https://github.com/Shellorley0513/CriticTool}。
English
The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning. In this work, we first extensively analyze the types of errors encountered during the function-calling process on several competitive tool evaluation benchmarks. Based on it, we introduce CRITICTOOL, a comprehensive critique evaluation benchmark specialized for tool learning. Building upon a novel evolutionary strategy for dataset construction, CRITICTOOL holds diverse tool-use errors with varying complexities, which better reflects real-world scenarios. We conduct extensive experiments on CRITICTOOL, and validate the generalization and effectiveness of our constructed benchmark strategy. We also provide an in-depth analysis of the tool reflection ability on various LLMs, offering a new perspective on the field of tool learning in LLMs. The code is available at https://github.com/Shellorley0513/CriticTool{https://github.com/Shellorley0513/CriticTool}.
PDF82June 18, 2025