ChatPaper.aiChatPaper

MobileWorld:面向智能体-用户交互及MCP增强环境下的自主移动代理基准测试

MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

December 22, 2025
作者: Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, Yue Wang
cs.AI

摘要

在现有移动端在线评测基准中,AndroidWorld凭借其可复现环境和确定性评估已成为主流基准。然而,近期智能体成功率超过90%的表现表明该基准已趋近饱和,亟需构建更具挑战性的新基准。此外,该基准环境缺失电子商务、企业通讯等关键应用类别,且未能体现用户指令模糊化、工具混合使用等真实移动端使用场景。为弥补这一缺陷,我们推出MobileWorld——一个在保持与AndroidWorld同等可复现评估水平的同时,包含20个应用程序共201项任务的基准,其设计能更好反映真实移动端使用场景且挑战性显著提升。 MobileWorld的挑战性主要体现在两方面:首先,它强调跨应用交互的长周期任务,平均任务完成步骤数达到AndroidWorld的近两倍(27.8步 vs 14.3步),多应用任务占比显著提高(62.2% vs 9.5%)。其次,该基准通过引入智能体-用户交互和MCP增强任务等新型任务类别,突破了标准图形界面操作的范畴。为确保评估鲁棒性,我们提供基于快照的容器环境及精确的功能验证机制,包括后端数据库检查与任务回调接口。我们还开发了具有扩展动作空间的规划-执行智能体框架,以支持用户交互和MCP调用。实验结果显示,相较于AndroidWorld基准,最佳智能体框架与端到端模型成功率分别骤降至51.7%和20.9%。分析表明,现有模型在用户交互和MCP调用方面存在显著不足,这为构建更强健的下一代移动智能技术提供了战略发展路径。
English
Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.
PDF72December 24, 2025