ChatPaper.aiChatPaper

MobileWorld:面向智能体-用户交互及MCP增强环境下的自主移动代理基准测试

MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

December 22, 2025
作者: Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, Yue Wang
cs.AI

摘要

在当前移动端在线评测基准中,AndroidWorld凭借其可复现环境和确定性评估已成为主流标准。然而,近期智能体成功率突破90%的数据表明该基准已趋近饱和,亟需更具挑战性的新基准。此外,该环境缺失电子商务、企业通信等关键应用类别,且未能体现用户指令模糊化与混合工具使用等真实移动端使用场景。为弥补这一差距,我们推出MobileWorld——一个在保持同等可复现评估水平的同时,包含20个应用程序共201项任务的挑战性基准,能更准确反映真实移动场景。MobileWorld的挑战性体现在双重维度:首先强调跨应用的长周期任务,其平均任务完成步骤达27.8步(AndroidWorld为14.3步),多应用任务占比达62.2%(AndroidWorld为9.5%);其次突破传统GUI操作范畴,新增智能体-用户交互及MCP增强任务等创新类别。为确保评估鲁棒性,我们提供基于快照的容器环境及包含后端数据库检测与任务回调API的精准功能验证。进一步开发了具有扩展动作空间的规划-执行智能体框架,以支持用户交互和MCP调用。实验结果显示性能较AndroidWorld出现断崖式下跌,最优智能体框架与端到端模型成功率分别为51.7%和20.9%。分析表明现有模型在用户交互与MCP调用方面存在显著不足,这为构建更强健的下一代移动智能技术指明了发展路径。
English
Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.
PDF72December 24, 2025