DigiData:通用移动控制智能体的训练与评估
DigiData: Training and Evaluating General-Purpose Mobile Control Agents
November 10, 2025
作者: Yuxuan Sun, Manchen Wang, Shengyi Qian, William R. Wong, Eric Gan, Pierluca D'Oro, Alejandro Castillejo Munoz, Sneha Silwal, Pedro Matias, Nitin Kamra, Satwik Kottur, Nick Raines, Xuanyi Zhao, Joy Chen, Joseph Greer, Andrea Madotto, Allen Bolourchi, James Valori, Kevin Carlberg, Karl Ridgeway, Joseph Tighe
cs.AI
摘要
能够操控用户界面的AI智能体具有彻底改变人类与数字设备交互方式的潜力。为加速这一变革,两大基础要素至关重要:一是能让智能体实现复杂且符合人类需求目标的高质量数据集,二是可供研究者和开发者快速提升智能体性能的稳健评估方法。本文推出DigiData——一个专为移动端控制智能体训练设计的大规模、高质量、多模态数据集。与现有基于非结构化交互生成目标的数据集不同,DigiData通过系统性探索应用程序功能精心构建,具备更丰富的多样性和更高的目标复杂度。同时,我们提出DigiData-Bench基准测试,用于评估智能体在真实世界复杂任务中的表现。研究证明,当前广泛使用的步骤准确率指标难以可靠评估移动控制智能体,为此我们提出动态评估协议和AI驱动的评估方法作为严格的替代方案。这些成果将显著推动移动控制智能体的发展,为更直观高效的人机交互铺平道路。
English
AI agents capable of controlling user interfaces have the potential to
transform human interaction with digital devices. To accelerate this
transformation, two fundamental building blocks are essential: high-quality
datasets that enable agents to achieve complex and human-relevant goals, and
robust evaluation methods that allow researchers and practitioners to rapidly
enhance agent performance. In this paper, we introduce DigiData, a large-scale,
high-quality, diverse, multi-modal dataset designed for training mobile control
agents. Unlike existing datasets, which derive goals from unstructured
interactions, DigiData is meticulously constructed through comprehensive
exploration of app features, resulting in greater diversity and higher goal
complexity. Additionally, we present DigiData-Bench, a benchmark for evaluating
mobile control agents on real-world complex tasks. We demonstrate that the
commonly used step-accuracy metric falls short in reliably assessing mobile
control agents and, to address this, we propose dynamic evaluation protocols
and AI-powered evaluations as rigorous alternatives for agent assessment. Our
contributions aim to significantly advance the development of mobile control
agents, paving the way for more intuitive and effective human-device
interactions.