VenusBench-Mobile：面向移动端GUI智能体的挑战性用户导向基准测试框架与能力诊断系统

摘要

当前移动端GUI智能体的在线基准测试大多以应用为中心且任务同质化，难以反映真实移动使用场景的多样性与不稳定性。为此，我们推出VenusBench-Mobile——一个在真实用户中心化场景下评估通用移动GUI智能体的挑战性在线基准。该基准构建两大核心评估支柱：通过反映真实使用场景的用户意图驱动型任务设计定义评估内容，借助面向能力维度的标注体系实现细粒度智能体行为分析。对前沿移动GUI智能体的大规模评估表明，其性能表现与既有基准存在显著差距，证明VenusBench-Mobile能提供更具挑战性和真实性的任务，且当前智能体距实际可靠部署仍有距离。诊断分析进一步揭示，感知与记忆能力缺陷是主要失败原因，而粗粒度评估往往掩盖这些问题。此外，即便最强智能体在环境变化下的成功率也趋近于零，凸显其在真实场景中的脆弱性。基于这些发现，我们认为VenusBench-Mobile为推进移动GUI智能体的实际稳健部署提供了重要基石。代码与数据详见：https://github.com/inclusionAI/UI-Venus/tree/VenusBench-Mobile。

English

Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online benchmark for evaluating general-purpose mobile GUI agents under realistic, user-centric conditions. VenusBench-Mobile builds two core evaluation pillars: defining what to evaluate via user-intent-driven task design that reflects real mobile usage, and how to evaluate through a capability-oriented annotation scheme for fine-grained agent behavior analysis. Extensive evaluation of state-of-the-art mobile GUI agents reveals large performance gaps relative to prior benchmarks, indicating that VenusBench-Mobile poses substantially more challenging and realistic tasks and that current agents remain far from reliable real-world deployment. Diagnostic analysis further shows that failures are dominated by deficiencies in perception and memory, which are largely obscured by coarse-grained evaluations. Moreover, even the strongest agents exhibit near-zero success under environment variations, highlighting their brittleness in realistic settings. Based on these insights, we believe VenusBench-Mobile provides an important stepping stone toward robust real-world deployment of mobile GUI agents. Code and data are available at https://github.com/inclusionAI/UI-Venus/tree/VenusBench-Mobile.