MobileWorldBench:面向移动智能体的语义化世界建模研究
MobileWorldBench: Towards Semantic World Modeling For Mobile Agents
December 16, 2025
作者: Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Aditya Grover
cs.AI
摘要
世界模型在提升具身智能体任务性能方面展现出巨大价值。现有研究主要聚焦于像素空间的世界模型,但这些方法在图形用户界面(GUI)场景下面临实际局限——预测未来状态中的复杂视觉元素往往十分困难。本研究探索了GUI智能体世界建模的替代方案:通过自然语言而非原始像素预测来描述状态转换。首先,我们推出MobileWorldBench基准测试,用于评估视觉语言模型(VLM)作为移动GUI智能体世界模型的性能表现。其次,我们发布包含140万样本的大规模数据集MobileWorld,该数据集显著提升了VLM的世界建模能力。最后,我们提出创新框架将VLM世界模型集成至移动智能体的规划系统中,证明语义世界模型可通过提升任务成功率直接赋能移动智能体。相关代码与数据集已开源:https://github.com/jacklishufan/MobileWorld
English
World models have shown great utility in improving the task performance of embodied agents. While prior work largely focuses on pixel-space world models, these approaches face practical limitations in GUI settings, where predicting complex visual elements in future states is often difficult. In this work, we explore an alternative formulation of world modeling for GUI agents, where state transitions are described in natural language rather than predicting raw pixels. First, we introduce MobileWorldBench, a benchmark that evaluates the ability of vision-language models (VLMs) to function as world models for mobile GUI agents. Second, we release MobileWorld, a large-scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs. Finally, we propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates. The code and dataset is available at https://github.com/jacklishufan/MobileWorld