屏幕上的图灵测试:移动端GUI智能体拟人化基准
Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
February 24, 2026
作者: Jiachen Zhu, Lingyu Yang, Rong Shan, Congmin Zheng, Zeyu Zheng, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin
cs.AI
摘要
自主图形用户界面代理的崛起引发了数字平台的反制措施,但现有研究过度关注实用性与鲁棒性,而忽视了反检测这一关键维度。我们认为,代理要在以人为中心的生态系统中生存,必须进化出拟人化能力。我们提出"屏幕图灵测试"框架,将交互形式化建模为检测器与代理之间的MinMax优化问题,其中代理以最小化行为差异为目标。通过采集新型高保真移动触控动力学数据集,我们发现基于原始大语言模型的代理因运动学特征不自然而极易被识别。为此,我们建立了代理拟人化基准测试体系及检测指标,用以量化模仿能力与效用之间的权衡。最后,我们提出从启发式噪声注入到数据驱动的行为匹配等多种方法,证明代理在理论和实践层面均能实现高拟真度且不损失性能。这项工作将研究范式从"代理能否完成任务"转向"如何在人类中心化生态中执行任务",为对抗性数字环境中的无缝共存奠定基础。
English
The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization capabilities. We introduce the ``Turing Test on Screen,'' formally modeling the interaction as a MinMax optimization problem between a detector and an agent aiming to minimize behavioral divergence. We then collect a new high-fidelity dataset of mobile touch dynamics, and conduct our analysis that vanilla LMM-based agents are easily detectable due to unnatural kinematics. Consequently, we establish the Agent Humanization Benchmark (AHB) and detection metrics to quantify the trade-off between imitability and utility. Finally, we propose methods ranging from heuristic noise to data-driven behavioral matching, demonstrating that agents can achieve high imitability theoretically and empirically without sacrificing performance. This work shifts the paradigm from whether an agent can perform a task to how it performs it within a human-centric ecosystem, laying the groundwork for seamless coexistence in adversarial digital environments.