GTA：汎用ツールエージェントのベンチマーク

要旨

大規模言語モデル（LLM）を様々なツールと統合し、汎用エージェントを開発することに大きな焦点が当てられています。これはLLMのツール使用能力に対する課題を提起しています。しかし、既存のツール使用評価と現実世界のシナリオの間には明らかなギャップが存在します。現在の評価では、AI生成クエリ、単一ステップタスク、ダミーツール、テキストのみのインタラクションが使用されることが多く、エージェントの現実世界の問題解決能力を効果的に明らかにすることができていません。この問題に対処するため、我々はGeneral Tool Agents（GTA）ベンチマークを提案します。このベンチマークは以下の3つの主要な側面を特徴としています：(i) 実ユーザークエリ：人間が作成したクエリで、シンプルな現実世界の目的を持ちながらもツール使用が暗黙的であり、LLMが適切なツールを推論し解決ステップを計画する必要があります。(ii) 実デプロイツール：知覚、操作、論理、創造性のカテゴリにわたるツールを備えた評価プラットフォームで、エージェントの実際のタスク実行性能を評価します。(iii) 実マルチモーダル入力：空間シーン、ウェブページのスクリーンショット、表、コードスニペット、印刷/手書き資料などの本物の画像ファイルをクエリコンテキストとして使用し、現実世界のシナリオに密接に合わせます。我々は229の現実世界タスクと実行可能なツールチェーンを設計し、主流のLLMを評価しました。その結果、現実世界のユーザークエリは既存のLLMにとって難易度が高く、GPT-4はタスクの50%未満しか完了できず、ほとんどのLLMは25%未満の達成率でした。この評価は、現実世界シナリオにおける現在のLLMのツール使用能力のボトルネックを明らかにし、汎用ツールエージェントの進化に向けた将来の方向性を提供します。コードとデータセットはhttps://github.com/open-compass/GTAで公開されています。

English

Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents. The code and dataset are available at https://github.com/open-compass/GTA.

GTA：汎用ツールエージェントのベンチマーク

GTA: A Benchmark for General Tool Agents

要旨

Support