ChatPaper.aiChatPaper

MCP-Persona:通過環境模擬對LLM代理在真實世界個人應用中的基準測試

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

June 1, 2026
作者: Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang, Jingxing Wang, Haoting Shi, Yaxin Du, Jingyi Chai, Xianghe Pang, Shuo Tang, Yanfeng Wang, Siheng Chen
cs.AI

摘要

模型上下文協定(MCP)已成為一種變革性標準,用於將大型語言模型與外部資料來源及工具相連接,並在個人應用與開發平台上迅速獲得採用。然而,現有基準測試主要聚焦於通用資訊查詢工具,未能充分捕捉個人社交應用所帶來的實際挑戰——此類應用中的工具需與個人帳戶或本地資料庫互動。為填補這一關鍵缺口,我們提出MCP-Persona,這是首個專門用於評估智慧體在真實個人化MCP工具上表現的基準。MCP-Persona涵蓋多樣化的廣泛應用,從Reddit、小紅書等社交媒體平台,到飛書、Slack等企業協作套件。我們對多種最新智慧體進行的廣泛實驗顯示,它們在個人化工具使用上顯著吃力,從而凸顯此基準在識別與應對這些限制方面的關鍵作用。MCP-Persona已公開於 https://github.com/wwh0411/MCP-Persona。
English
The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic information-seeking tools and fail to capture the practical challenges posed by personal social applications, where tools interact with individual accounts or local databases. To bridge this critical gap, we introduce MCP-Persona, the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. MCP-Persona encompasses a diverse set of widely-used applications, ranging from social media platforms like Reddit and Xiaohongshu (Rednote) to enterprise collaboration suites such as Lark (Feishu) and Slack. Our extensive experiments on various state-of-the-art (SOTA) agents demonstrate their significant struggles with personalized tool use, thereby highlighting the benchmark's crucial role in identifying and addressing these limitations. MCP-Persona is publicly available at https://github.com/wwh0411/MCP-Persona}{https://github.com/wwh0411/MCP-Persona.