在“想象空间”中的LLMs：通过模拟试错学习工具

摘要

工具对于大型语言模型（LLMs）在获取最新信息并在外部环境中采取重要行动方面至关重要。现有关于工具增强型LLMs的研究主要集中在工具的广泛覆盖范围和添加新工具的灵活性上。然而，一个被人们惊讶地忽视的关键方面是LLM准确地使用其训练过的工具。我们发现，包括GPT-4和专门针对工具使用进行微调的开源LLMs在正确率方面仅达到30%至60%的范围，远未达到实际可靠使用的水平。我们提出了一种受生物启发的工具增强型LLMs方法，即模拟试错（STE），它协调了生物系统中成功使用工具行为的三个关键机制：试错、想象力和记忆。具体而言，STE利用LLM的“想象力”模拟使用工具的可能场景，之后LLM与工具互动以从执行反馈中学习。短期和长期记忆被用来分别提高探索的深度和广度。在ToolBench上进行的全面实验表明，STE显著改善了LLMs的工具学习，在上下文学习和微调设置下，为Mistral-Instruct-7B带来了46.7%的提升，并使其胜过了GPT-4。我们还展示了通过简单的经验重放策略有效地持续学习工具。

English

Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM's 'imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.

在“想象空间”中的LLMs：通过模拟试错学习工具

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

摘要

Support