ChatPaper.aiChatPaper

MCP-Persona:通过环境模拟对LLM代理在现实世界个人应用中进行基准测试

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

June 1, 2026
作者: Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang, Jingxing Wang, Haoting Shi, Yaxin Du, Jingyi Chai, Xianghe Pang, Shuo Tang, Yanfeng Wang, Siheng Chen
cs.AI

摘要

模型上下文协议(MCP)已成为连接大型语言模型(LLMs)与外部数据源及工具的一项变革性标准,并迅速在个人应用和开发平台中得到采用。然而,现有基准测试主要聚焦于通用信息检索工具,未能捕捉个人社交应用中的实际挑战——此类应用中的工具需与个人账户或本地数据库交互。为弥补这一关键缺口,我们提出MCP-Persona,这是首个专门用于评估智能体在真实世界个性化MCP工具上表现的基准测试。MCP-Persona涵盖多样化的广泛使用应用,从Reddit、小红书(Rednote)等社交媒体平台,到飞书(Lark)、Slack等企业协作套件。我们对多种最先进(SOTA)智能体进行的广泛实验表明,它们在个性化工具使用方面存在显著困难,从而凸显出该基准测试在识别和应对这些局限性方面的关键作用。MCP-Persona现已公开,访问地址为:https://github.com/wwh0411/MCP-Persona。
English
The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic information-seeking tools and fail to capture the practical challenges posed by personal social applications, where tools interact with individual accounts or local databases. To bridge this critical gap, we introduce MCP-Persona, the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. MCP-Persona encompasses a diverse set of widely-used applications, ranging from social media platforms like Reddit and Xiaohongshu (Rednote) to enterprise collaboration suites such as Lark (Feishu) and Slack. Our extensive experiments on various state-of-the-art (SOTA) agents demonstrate their significant struggles with personalized tool use, thereby highlighting the benchmark's crucial role in identifying and addressing these limitations. MCP-Persona is publicly available at https://github.com/wwh0411/MCP-Persona}{https://github.com/wwh0411/MCP-Persona.