ChatPaper.aiChatPaper

OSWorld-MCP:计算机应用代理中MCP工具调用的基准测试

OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents

October 28, 2025
作者: Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, Fei Huang
cs.AI

摘要

随着决策与推理能力的进步,多模态智能体在计算机应用场景中展现出巨大潜力。现有评估主要关注图形用户界面交互能力,而对基于模型上下文协议(MCP)等工具调用功能的评估长期被忽视。将集成工具调用的智能体与仅支持GUI交互的智能体直接对比存在本质上的不公平。我们推出OSWorld-MCP——首个在真实环境中全面公正评估计算机使用智能体的工具调用、GUI操作及决策能力的基准测试平台。通过创新的自动化代码生成流水线,我们构建了涵盖7类常用应用的158个高质量工具(均经过功能正确性、实用性与多场景适用性验证),并结合现有工具库进行精选。基于OSWorld-MCP对前沿多模态智能体的广泛测试表明:MCP工具能显著提升任务成功率(如OpenAI o3在15步时从8.3%提升至20.4%,Claude 4 Sonnet在50步时从40.1%提升至43.3%),印证了评估工具调用能力的必要性。但当前最强模型的工具调用率仍偏低(仅36.3%),既揭示了改进空间,也凸显了该基准的挑战性。OSWorld-MCP通过显式衡量MCP工具使用技能,深化了对多模态智能体的认知,为复杂工具辅助环境下的性能评估设立了新标准。相关代码、环境及数据已开源:https://osworld-mcp.github.io。
English
With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents' tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3%, indicating room for improvement and highlighting the benchmark's challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. Our code, environment, and data are publicly available at https://osworld-mcp.github.io.
PDF221December 1, 2025