螢幕圖靈測試：移動端圖形用戶界面代理擬人化基準評測

摘要

自主圖形使用者介面代理的興起引發了數位平台的對抗性防制措施，然而現有研究過度側重效用與魯棒性，卻忽略了反偵測這一關鍵維度。我們主張，代理若要在以人為核心的生態系統中存續，必須發展出「擬人化」能力。本文提出「螢幕上的圖靈測試」，將此互動形式化建模為偵測器與代理之間的最小最大化優化問題，目標是縮小行為差異。我們進而收集了全新的高精度移動觸控動態數據集，分析發現基於原始大型多模態模型的代理因運動學特徵不自然而極易被識破。據此，我們建立「代理擬人化基準測試」與偵測指標，用以量化模仿能力與效用之間的權衡關係。最後，我們提出從啟發式噪聲注入到數據驅動行為匹配等多種方法，證實代理無論在理論或實證層面，皆能在不犧牲效能的前提下實現高度擬真度。本研究將焦點從「代理能否執行任務」轉向「如何在人本生態中執行任務」，為對抗性數位環境中的無縫共存奠定基礎。

English

The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization capabilities. We introduce the ``Turing Test on Screen,'' formally modeling the interaction as a MinMax optimization problem between a detector and an agent aiming to minimize behavioral divergence. We then collect a new high-fidelity dataset of mobile touch dynamics, and conduct our analysis that vanilla LMM-based agents are easily detectable due to unnatural kinematics. Consequently, we establish the Agent Humanization Benchmark (AHB) and detection metrics to quantify the trade-off between imitability and utility. Finally, we propose methods ranging from heuristic noise to data-driven behavioral matching, demonstrating that agents can achieve high imitability theoretically and empirically without sacrificing performance. This work shifts the paradigm from whether an agent can perform a task to how it performs it within a human-centric ecosystem, laying the groundwork for seamless coexistence in adversarial digital environments.

螢幕圖靈測試：移動端圖形用戶界面代理擬人化基準評測

Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

摘要

Support