CRITICTOOL: ツール呼び出しエラーシナリオにおける大規模言語モデルの自己批判能力の評価

要旨

大規模言語モデル（LLMs）が外部ツールを利用する能力により、多様なタスクに対応できるようになった。しかし、タスクがより複雑で長期的になるにつれ、複雑なツール利用プロセスが様々な予期せぬエラーを引き起こす可能性がある。そのため、エラーを効果的に識別、診断、回復する方法が、ツール学習を進める上での重要な研究課題として浮上している。本研究では、まず、いくつかの競争力のあるツール評価ベンチマークにおいて、関数呼び出しプロセス中に遭遇するエラーの種類を詳細に分析する。これに基づき、ツール学習に特化した包括的な批評評価ベンチマークであるCRITICTOOLを導入する。新しい進化的戦略に基づくデータセット構築により、CRITICTOOLは複雑さの異なる多様なツール利用エラーを包含し、現実世界のシナリオをよりよく反映している。CRITICTOOL上で広範な実験を行い、構築したベンチマーク戦略の汎用性と有効性を検証する。また、様々なLLMsにおけるツール反射能力の詳細な分析を提供し、LLMsのツール学習分野に新たな視点を提供する。コードはhttps://github.com/Shellorley0513/CriticTool{https://github.com/Shellorley0513/CriticTool}で公開されている。

English

The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning. In this work, we first extensively analyze the types of errors encountered during the function-calling process on several competitive tool evaluation benchmarks. Based on it, we introduce CRITICTOOL, a comprehensive critique evaluation benchmark specialized for tool learning. Building upon a novel evolutionary strategy for dataset construction, CRITICTOOL holds diverse tool-use errors with varying complexities, which better reflects real-world scenarios. We conduct extensive experiments on CRITICTOOL, and validate the generalization and effectiveness of our constructed benchmark strategy. We also provide an in-depth analysis of the tool reflection ability on various LLMs, offering a new perspective on the field of tool learning in LLMs. The code is available at https://github.com/Shellorley0513/CriticTool{https://github.com/Shellorley0513/CriticTool}.

CRITICTOOL: ツール呼び出しエラーシナリオにおける大規模言語モデルの自己批判能力の評価

CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios

要旨

Support