ChatPaper.aiChatPaper

GTA:通用工具代理基准

GTA: A Benchmark for General Tool Agents

July 11, 2024
作者: Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, Xinyi Le
cs.AI

摘要

在开发通用代理程序时,将大型语言模型(LLMs)与各种工具进行整合受到了重视。这给LLMs的工具使用能力带来了挑战。然而,现有工具使用评估与真实场景之间存在明显差距。目前的评估通常使用人工智能生成的查询、单步任务、虚拟工具和仅文本交互,未能有效展示代理程序在真实世界中解决问题的能力。为了解决这个问题,我们提出了GTA,即通用工具代理的基准测试,包括三个主要方面:(i)真实用户查询:由人编写的带有简单真实世界目标但隐含工具使用的查询,需要LLM推理出适当的工具并规划解决步骤。(ii)真实部署工具:一个评估平台,配备了涵盖感知、操作、逻辑和创造力类别的工具,以评估代理程序的实际任务执行性能。(iii)真实多模态输入:真实图像文件,如空间场景、网页截图、表格、代码片段以及印刷/手写材料,作为查询上下文使用,以与真实世界场景紧密对齐。我们设计了229个真实世界任务和可执行工具链,以评估主流LLMs。我们的研究结果显示,现有LLMs对真实世界用户查询具有挑战性,其中GPT-4仅完成不到50%的任务,大多数LLMs的完成率低于25%。这一评估揭示了当前LLMs在真实世界场景中工具使用能力的瓶颈,为推进通用工具代理程序提供了未来方向。代码和数据集可在https://github.com/open-compass/GTA 获取。
English
Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents. The code and dataset are available at https://github.com/open-compass/GTA.

Summary

AI-Generated Summary

PDF173November 28, 2024