工具文档使大型语言模型实现零-shot 工具使用

摘要

如今，大型语言模型（LLMs）通过展示工具的使用方式来学习使用新工具。不幸的是，获取这些展示很困难，如果选择了错误的展示，可能会导致不良的偏见使用。即使在罕见的情况下展示很容易获取，也没有原则性的选择协议来确定要提供多少个展示以及哪些展示。随着任务变得更加复杂，选择搜索呈组合增长，变得难以处理。我们的工作提供了一种替代展示的方法：工具文档。我们主张使用工具文档，即对个别工具使用的描述，而不是展示。我们通过跨视觉和语言模态的6个任务的三个主要实证发现来证实我们的说法。首先，在现有基准上，仅凭借工具文档的零样本提示就足以引出正确的工具使用，达到与少样本提示相当的性能。其次，在一个新收集的现实工具使用数据集中，有数百个可用工具API，我们展示了工具文档比展示更有价值，零样本文档明显优于没有文档的少样本。第三，我们通过使用刚发布的看不见的最先进模型作为工具，解决图像生成和视频跟踪问题，突出了工具文档的好处。最后，我们强调了使用工具文档自动启用新应用的可能性：仅通过使用GroundingDino、Stable Diffusion、XMem和SAM的文档，LLMs就可以重新发明刚发布的Grounded-SAM和Track Anything模型的功能。

English

Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and which ones to provide. As tasks grow more complex, the selection search grows combinatorially and invariably becomes intractable. Our work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentation, descriptions for the individual tool usage, over demonstrations. We substantiate our claim through three main empirical findings on 6 tasks across both vision and language modalities. First, on existing benchmarks, zero-shot prompts with only tool documentation are sufficient for eliciting proper tool usage, achieving performance on par with few-shot prompts. Second, on a newly collected realistic tool-use dataset with hundreds of available tool APIs, we show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation. Third, we highlight the benefits of tool documentations by tackling image generation and video tracking using just-released unseen state-of-the-art models as tools. Finally, we highlight the possibility of using tool documentation to automatically enable new applications: by using nothing more than the documentation of GroundingDino, Stable Diffusion, XMem, and SAM, LLMs can re-invent the functionalities of the just-released Grounded-SAM and Track Anything models.

工具文档使大型语言模型实现零-shot 工具使用

Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models

摘要

Support