GTA-2基准：从原子级工具使用到开放式工作流的通用工具智能体评估体系

摘要

通用智能体的发展需要从执行简单指令转向完成复杂的现实世界生产力工作流。然而当前的工具使用评估标准仍与真实需求脱节，依赖AI生成的查询、虚拟工具和有限的系统级协调。为此，我们提出GTA-2——一个面向通用工具智能体的分层评估体系，涵盖原子级工具使用与开放式工作流。该体系基于真实世界场景构建，采用真实用户查询、已部署工具和多模态上下文：（i）GTA-Atomic继承自我们先前提出的GTA基准，评估短周期封闭式工具使用精度；（ii）GTA-Workflow引入长周期开放式任务，实现真实端到端完成度评估。针对开放式产出，我们提出基于递归检查点的评估机制，将目标分解为可验证的子目标，实现对模型能力与智能体执行框架（即执行环境）的统一评估。实验表明存在显著的能力断层：前沿模型在原子任务上表现已不理想（低于50%），在工作流任务中更是严重失效，顶级模型成功率仅达14.39%。进一步分析显示，检查点引导的反馈能提升性能，而Manus、OpenClaw等先进框架可显著改善工作流完成度，这揭示了执行环境设计相较于底层模型能力的重要性。这些发现为开发可靠的个人及专业助手提供了指导。数据集与代码将在https://github.com/open-compass/GTA 发布。

English

The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose GTA-2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open-ended workflows. Built on real-world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA-Atomic, inherited from our prior GTA benchmark, evaluates short-horizon, closed-ended tool-use precision. (ii) GTA-Workflow introduces long-horizon, open-ended tasks for realistic end-to-end completion. To evaluate open-ended deliverables, we propose a recursive checkpoint-based evaluation mechanism that decomposes objectives into verifiable sub-goals, enabling unified evaluation of both model capabilities and agent execution frameworks (i.e., execution harnesses). Experiments reveal a pronounced capability cliff: while frontier models already struggle on atomic tasks (below 50%), they largely fail on workflows, with top models achieving only 14.39% success. Further analysis shows that checkpoint-guided feedback improves performance, while advanced frameworks such as Manus and OpenClaw substantially enhance workflow completion, highlighting the importance of execution harness design beyond the underlying model capacity. These findings provide guidance for developing reliable personal and professional assistants. Dataset and code will be available at https://github.com/open-compass/GTA.