CRITICTOOL: 도구 호출 오류 시나리오에서 대규모 언어 모델의 자기 비판 능력 평가

초록

대규모 언어 모델(LLM)이 외부 도구를 활용할 수 있는 능력은 점점 더 다양한 작업을 해결할 수 있게 해주었습니다. 그러나 작업이 더 복잡하고 장기적인 목표를 가지게 되면서, 정교한 도구 활용 과정에서 다양한 예기치 못한 오류가 발생할 수 있습니다. 따라서 이러한 오류를 효과적으로 처리하는 방법, 즉 오류를 식별하고 진단하며 복구하는 방법은 도구 학습을 발전시키기 위한 핵심 연구 방향으로 떠오르고 있습니다. 본 연구에서는 먼저 여러 경쟁력 있는 도구 평가 벤치마크에서 함수 호출 과정 중 발생하는 오류 유형을 광범위하게 분석합니다. 이를 바탕으로 도구 학습에 특화된 포괄적인 비평 평가 벤치마크인 CRITICTOOL을 소개합니다. 새로운 진화 전략을 기반으로 데이터셋을 구축한 CRITICTOOL은 다양한 복잡성을 가진 도구 사용 오류를 포함하고 있어 실제 시나리오를 더 잘 반영합니다. CRITICTOOL에 대한 광범위한 실험을 수행하고, 우리가 구축한 벤치마크 전략의 일반화와 효과성을 검증합니다. 또한 다양한 LLM의 도구 반영 능력에 대한 심층 분석을 제공하여 LLM의 도구 학습 분야에 새로운 관점을 제시합니다. 코드는 https://github.com/Shellorley0513/CriticTool{https://github.com/Shellorley0513/CriticTool}에서 확인할 수 있습니다.

English

The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning. In this work, we first extensively analyze the types of errors encountered during the function-calling process on several competitive tool evaluation benchmarks. Based on it, we introduce CRITICTOOL, a comprehensive critique evaluation benchmark specialized for tool learning. Building upon a novel evolutionary strategy for dataset construction, CRITICTOOL holds diverse tool-use errors with varying complexities, which better reflects real-world scenarios. We conduct extensive experiments on CRITICTOOL, and validate the generalization and effectiveness of our constructed benchmark strategy. We also provide an in-depth analysis of the tool reflection ability on various LLMs, offering a new perspective on the field of tool learning in LLMs. The code is available at https://github.com/Shellorley0513/CriticTool{https://github.com/Shellorley0513/CriticTool}.

CRITICTOOL: 도구 호출 오류 시나리오에서 대규모 언어 모델의 자기 비판 능력 평가

CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios

초록

Support