実世界の脅威下におけるモバイルGUIエージェント：我々は目標を達成したのか？

要旨

近年、大規模言語モデル（LLM）を基盤とするモバイルGUIエージェントが急速に発展し、自然言語指示に基づいて多様なデバイス制御タスクを自律的に実行できるようになってきている。標準ベンチマークにおけるこれらのエージェントの精度向上は、大規模な実世界展開への期待を高めており、すでにいくつかの商用エージェントが初期採用者によってリリース・使用されている。しかし、日常のデバイスにシステム構築要素として統合されたGUIエージェントに対して、我々は本当に準備ができているのだろうか。我々は、エージェントが実世界の脅威下で性能を維持できるかを検証する、重要な導入前評価が欠けていると主張する。具体的には、異なるテスト間の環境一貫性を保つために単純な静的なアプリコンテンツに基づかざるを得ない既存の一般的なベンチマークとは異なり、実世界のアプリは、広告メール、ユーザー生成の投稿やメディアなど、信頼できない第三者からのコンテンツで満ちている。……この目的のために、我々は既存アプリケーション内で柔軟かつ標的型のコンテンツ変更を可能にする、スケーラブルなアプリコンテンツ計装フレームワークを提案する。このフレームワークを活用し、動的タスク実行環境と、困難なGUI状態から構成される静的データセットの両方を含むテストスイートを構築した。動的環境は122の再現可能なタスクを含み、静的データセットは商用アプリから構築された3,000以上のシナリオで構成される。オープンソースおよび商用のGUIエージェントに対して実験を行った。結果は、調査対象の全てのエージェントが第三者コンテンツにより性能が大幅に低下し得ることを明らかにし、動的環境と静的環境においてそれぞれ平均42.0%、36.1%の誤誘導率を示した。本フレームワークとベンチマークは https://agenthazard.github.io で公開されている。

English

Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse device-control tasks based on natural language instructions. The increasing accuracy of these agents on standard benchmarks has raised expectations for large-scale real-world deployment, and there are already several commercial agents released and used by early adopters. However, are we really ready for GUI agents integrated into our daily devices as system building blocks? We argue that an important pre-deployment validation is missing to examine whether the agents can maintain their performance under real-world threats. Specifically, unlike existing common benchmarks that are based on simple static app contents (they have to do so to ensure environment consistency between different tests), real-world apps are filled with contents from untrustworthy third parties, such as advertisement emails, user-generated posts and medias, etc. ... To this end, we introduce a scalable app content instrumentation framework to enable flexible and targeted content modifications within existing applications. Leveraging this framework, we create a test suite comprising both a dynamic task execution environment and a static dataset of challenging GUI states. The dynamic environment encompasses 122 reproducible tasks, and the static dataset consists of over 3,000 scenarios constructed from commercial apps. We perform experiments on both open-source and commercial GUI agents. Our findings reveal that all examined agents can be significantly degraded due to third-party contents, with an average misleading rate of 42.0% and 36.1% in dynamic and static environments respectively. The framework and benchmark has been released at https://agenthazard.github.io.

実世界の脅威下におけるモバイルGUIエージェント：我々は目標を達成したのか？

Mobile GUI Agents under Real-world Threats: Are We There Yet?

要旨

Support