ChatPaper.aiChatPaper

OSWorld-MCP:计算机使用代理中MCP工具调用的基准测试框架

OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents

October 28, 2025
作者: Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, Fei Huang
cs.AI

摘要

随着决策与推理能力的进步,多模态智能体在计算机应用场景中展现出巨大潜力。现有评估主要关注图形用户界面交互能力,而对基于模型上下文协议(MCP)等工具调用功能的评估长期缺失。将集成工具调用的智能体与仅评估GUI交互的智能体直接对比存在本质不公。我们推出OSWorld-MCP——首个在真实环境中全面公正评估计算机使用智能体的工具调用、GUI操作及决策能力的基准平台。通过创新的自动化代码生成流程,我们既创建了新工具,又整合了现有工具的精选集合。经严格人工验证后形成158个高质量工具(覆盖7类常用应用),每个工具均通过功能性、实用性与多场景适用性三重检验。基于OSWorld-MCP对前沿多模态智能体的大规模评估表明:MCP工具能普遍提升任务成功率(如OpenAI o3在15步时从8.3%提升至20.4%,Claude 4 Sonnet在50步时从40.1%提升至43.3%),印证了评估工具调用能力的必要性。但当前最强模型的工具调用率仍偏低(仅36.3%),既揭示改进空间,也凸显该基准的挑战性。OSWorld-MCP通过显式衡量MCP工具使用技能,深化了对多模态智能体的认知,为复杂工具辅助环境下的性能评估设立了新标准。相关代码、环境及数据已公开于https://osworld-mcp.github.io。
English
With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents' tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3%, indicating room for improvement and highlighting the benchmark's challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. Our code, environment, and data are publicly available at https://osworld-mcp.github.io.
PDF221December 1, 2025