GTA: 범용 도구 에이전트를 위한 벤치마크

초록

일반 목적 에이전트 개발에 있어서 대형 언어 모델(LLM)과 다양한 도구의 통합에 상당한 관심이 집중되고 있습니다. 이는 LLM의 도구 사용 능력에 대한 도전으로 작용합니다. 그러나 기존의 도구 사용 평가와 실제 시나리오 사이에는 뚜렷한 격차가 존재합니다. 현재의 평가는 주로 AI 생성 쿼리, 단일 단계 작업, 더미 도구, 그리고 텍스트 전용 상호작용을 사용하여 에이전트의 실제 문제 해결 능력을 효과적으로 드러내지 못하고 있습니다. 이를 해결하기 위해 우리는 GTA(General Tool Agents)라는 벤치마크를 제안합니다. 이 벤치마크는 세 가지 주요 측면을 포함합니다: (i) 실제 사용자 쿼리: 간단한 실제 목표를 가진 인간이 작성한 쿼리로, 도구 사용이 암시적이며 LLM이 적합한 도구를 추론하고 해결 단계를 계획해야 합니다. (ii) 실제 배포된 도구: 인지, 운영, 논리, 창의성 카테고리에 걸친 도구를 갖춘 평가 플랫폼으로, 에이전트의 실제 작업 실행 성능을 평가합니다. (iii) 실제 다중 모드 입력: 공간적 장면, 웹 페이지 스크린샷, 테이블, 코드 조각, 인쇄/필기 자료와 같은 실제 이미지 파일을 쿼리 컨텍스트로 사용하여 실제 시나리오와 밀접하게 일치시킵니다. 우리는 229개의 실제 작업과 실행 가능한 도구 체인을 설계하여 주요 LLM을 평가했습니다. 우리의 연구 결과는 실제 사용자 쿼리가 기존 LLM에게 어려운 것으로 나타났으며, GPT-4은 작업의 50% 미만을 완료했고 대부분의 LLM은 25% 미만의 성과를 보였습니다. 이 평가는 현재 LLM의 도구 사용 능력이 실제 시나리오에서 겪는 병목 현상을 드러내며, 일반 목적 도구 에이전트의 발전을 위한 미래 방향을 제시합니다. 코드와 데이터셋은 https://github.com/open-compass/GTA에서 확인할 수 있습니다.

English

Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents. The code and dataset are available at https://github.com/open-compass/GTA.

GTA: 범용 도구 에이전트를 위한 벤치마크

GTA: A Benchmark for General Tool Agents

초록

Support