工具文件使大型語言模型實現零-shot 工具使用

摘要

如今，大型語言模型（LLMs）通常透過提供一些工具使用的示範來學習新工具。不幸的是，示範難以取得，若選擇了不適當的示範，可能導致不良的偏見使用。即使在示範容易取得的罕見情況下，也沒有原則性的選擇協議來確定應提供多少以及哪些示範。隨著任務變得更加複雜，選擇搜索會以組合方式增長，最終變得難以處理。我們的研究提供了一種替代示範的方法：工具文件。我們主張使用工具文件，即個別工具使用的描述，而非示範。我們透過對視覺和語言模式下6個任務的三項主要實證發現來證實我們的主張。首先，在現有基準測試中，僅使用工具文件的零猜測提示足以引出正確的工具使用，實現與少猜測提示相當的性能。其次，在新收集的實際工具使用數據集上，其中包含數百個可用的工具API，我們展示了工具文件比示範顯著更有價值，零猜測文件明顯優於無文件的少猜測。第三，我們通過僅使用最新釋出的看不見的最先進模型作為工具，解決圖像生成和視頻跟踪問題，突出了工具文件的好處。最後，我們強調使用工具文件自動啟用新應用的可能性：僅使用GroundingDino、Stable Diffusion、XMem和SAM的文件，LLMs就可以重新創造剛釋出的Grounded-SAM和Track Anything模型的功能。

English

Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and which ones to provide. As tasks grow more complex, the selection search grows combinatorially and invariably becomes intractable. Our work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentation, descriptions for the individual tool usage, over demonstrations. We substantiate our claim through three main empirical findings on 6 tasks across both vision and language modalities. First, on existing benchmarks, zero-shot prompts with only tool documentation are sufficient for eliciting proper tool usage, achieving performance on par with few-shot prompts. Second, on a newly collected realistic tool-use dataset with hundreds of available tool APIs, we show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation. Third, we highlight the benefits of tool documentations by tackling image generation and video tracking using just-released unseen state-of-the-art models as tools. Finally, we highlight the possibility of using tool documentation to automatically enable new applications: by using nothing more than the documentation of GroundingDino, Stable Diffusion, XMem, and SAM, LLMs can re-invent the functionalities of the just-released Grounded-SAM and Track Anything models.

工具文件使大型語言模型實現零-shot 工具使用

Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models

摘要

Support