ツールドキュメントにより、大規模言語モデルでのゼロショットツール使用が可能になる

要旨

今日、大規模言語モデル（LLM）は、ツールの使用例をいくつか提供することで新しいツールの使用方法を学びます。しかし、使用例を入手するのは困難であり、誤った使用例を選択すると望ましくない偏った使用を招く可能性があります。使用例が容易に入手できる稀な状況であっても、どの使用例をいくつ提供するかを決定するための原則的な選択プロトコルは存在しません。タスクが複雑になるにつれて、選択の探索は組み合わせ的に増大し、必然的に扱いきれなくなります。私たちの研究は、使用例に代わるものとしてツールのドキュメントを提案します。私たちは、個々のツールの使用方法を説明するツールドキュメントの使用を提唱します。この主張を裏付けるために、視覚と言語の両モダリティにわたる6つのタスクで得られた3つの主要な実証的結果を示します。第一に、既存のベンチマークにおいて、ツールドキュメントのみを用いたゼロショットプロンプトが適切なツール使用を引き出すのに十分であり、少ない使用例を用いたプロンプトと同等の性能を達成することが示されました。第二に、数百の利用可能なツールAPIを含む新たに収集された現実的なツール使用データセットにおいて、ツールドキュメントが使用例よりもはるかに価値があり、ドキュメントを用いたゼロショットがドキュメントなしの少ない使用例を大幅に上回ることを示しました。第三に、画像生成とビデオ追跡に最新の未公開の最先端モデルをツールとして使用することで、ツールドキュメントの利点を強調します。最後に、ツールドキュメントを使用して新しいアプリケーションを自動的に有効にする可能性を強調します。GroundingDino、Stable Diffusion、XMem、およびSAMのドキュメントを使用することで、LLMは最新リリースのGrounded-SAMおよびTrack Anythingモデルの機能を再発明できることを示します。

English

Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and which ones to provide. As tasks grow more complex, the selection search grows combinatorially and invariably becomes intractable. Our work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentation, descriptions for the individual tool usage, over demonstrations. We substantiate our claim through three main empirical findings on 6 tasks across both vision and language modalities. First, on existing benchmarks, zero-shot prompts with only tool documentation are sufficient for eliciting proper tool usage, achieving performance on par with few-shot prompts. Second, on a newly collected realistic tool-use dataset with hundreds of available tool APIs, we show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation. Third, we highlight the benefits of tool documentations by tackling image generation and video tracking using just-released unseen state-of-the-art models as tools. Finally, we highlight the possibility of using tool documentation to automatically enable new applications: by using nothing more than the documentation of GroundingDino, Stable Diffusion, XMem, and SAM, LLMs can re-invent the functionalities of the just-released Grounded-SAM and Track Anything models.

ツールドキュメントにより、大規模言語モデルでのゼロショットツール使用が可能になる

Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models

要旨

Support