想像の世界における大規模言語モデル：模擬的な試行錯誤を通じたツール学習

要旨

ツールは、大規模言語モデル（LLM）が最新の情報を取得し、外部環境で重要な行動を取るために不可欠です。既存のツール拡張LLMに関する研究は、主にツールの広範なカバレッジと新しいツールを追加する柔軟性に焦点を当てています。しかし、驚くべきことに、LLMが訓練されたツールをどれだけ正確に使用するかという重要な側面は十分に研究されていません。GPT-4やツール使用のために特別にファインチューニングされたオープンソースのLLMを含む既存のLLMは、正答率が30%から60%の範囲に留まり、実践的な信頼性には程遠いことがわかりました。私たちは、生物学的にインスパイアされた方法である「模擬試行錯誤（Simulated Trial and Error, STE）」を提案します。STEは、生物学的システムにおける成功したツール使用行動のための3つの主要なメカニズム、すなわち試行錯誤、想像力、および記憶を調整します。具体的には、STEはLLMの「想像力」を活用してツールを使用するための妥当なシナリオをシミュレートし、その後、LLMがツールと相互作用して実行フィードバックから学習します。短期記憶と長期記憶の両方を活用して、探索の深さと広さをそれぞれ改善します。ToolBenchでの包括的な実験により、STEはコンテキスト内学習とファインチューニングの両方の設定においてLLMのツール学習を大幅に改善し、Mistral-Instruct-7Bに46.7%の向上をもたらし、GPT-4を上回る性能を発揮させることが示されました。また、シンプルな経験再生戦略を通じてツールの効果的な継続学習も実証しました。

English

Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM's 'imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.

想像の世界における大規模言語モデル：模擬的な試行錯誤を通じたツール学習

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

要旨

Support