ChatPaper.aiChatPaper

檢索模型不善於工具使用:大型語言模型的工具檢索基準測試

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

March 3, 2025
作者: Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, Zhaochun Ren
cs.AI

摘要

工具學習旨在通過多樣化的工具來增強大型語言模型(LLMs),使其能夠作為代理解決實際任務。由於使用工具的LLMs的上下文長度有限,採用信息檢索(IR)模型從大型工具集中選擇有用工具是一個關鍵的初始步驟。然而,IR模型在工具檢索任務中的表現仍未得到充分探索且不明確。大多數工具使用基準通過手動預註釋每個任務的一小部分相關工具來簡化這一步驟,這與現實場景相去甚遠。在本文中,我們提出了ToolRet,這是一個包含7.6k多樣化檢索任務的異構工具檢索基準,以及一個從現有數據集中收集的43k工具庫。我們在ToolRet上對六種類型的模型進行了基準測試。令人驚訝的是,即使在傳統IR基準中表現強勁的模型,在ToolRet上也表現不佳。這種低檢索質量降低了使用工具的LLMs的任務通過率。作為進一步的步驟,我們貢獻了一個包含超過200k實例的大規模訓練數據集,這大大優化了IR模型的工具檢索能力。
English
Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.

Summary

AI-Generated Summary

PDF42March 6, 2025