ChatPaper.aiChatPaper

GTA:一個通用工具智能體的基準測試

GTA: A Benchmark for General Tool Agents

July 11, 2024
作者: Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, Xinyi Le
cs.AI

摘要

在發展通用代理程式時,將大型語言模型(LLMs)與各種工具整合已受到重視。這對LLMs的工具使用能力構成挑戰。然而,現有的工具使用評估與真實世界情境之間存在明顯差距。目前的評估通常使用由人工智慧生成的查詢、單步任務、虛擬工具和僅限文本互動,未能有效揭示代理程式在真實世界解決問題的能力。為了解決這個問題,我們提出了GTA,一個通用工具代理程式的基準測試,包括三個主要方面:(i)真實用戶查詢:由人類撰寫的查詢,具有簡單的真實世界目標但隱含工具使用,需要LLM推理出適當的工具並規劃解決步驟。(ii)實際部署的工具:一個配備感知、操作、邏輯和創造力等類別工具的評估平台,用於評估代理程式實際任務執行表現。(iii)真實多模態輸入:包括空間場景、網頁截圖、表格、程式碼片段和印刷/手寫材料等真實圖像文件,作為查詢上下文以與真實世界情境緊密對齊。我們設計了229個真實世界任務和可執行的工具鏈,用於評估主流LLMs。我們的研究結果顯示,現有LLMs在真實世界用戶查詢方面具有挑戰性,其中GPT-4完成不到50%的任務,大多數LLMs的完成率低於25%。這種評估揭示了目前LLMs在真實世界情境中工具使用能力的瓶頸,為推進通用工具代理程式提供了未來方向。程式碼和數據集可在https://github.com/open-compass/GTA找到。
English
Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents. The code and dataset are available at https://github.com/open-compass/GTA.

Summary

AI-Generated Summary

PDF173November 28, 2024