실제 위협 환경에서의 모바일 GUI 에이전트: 아직 갈 길이 먼가?

초록

최근 대규모 언어 모델(LLM) 기반 모바일 GUI 에이전트가 빠르게 발전하며, 자연어 지시에 따라 다양한 기기 제어 작업을 자율적으로 수행할 수 있게 되었습니다. 표준 벤치마크에서 이러한 에이전트의 정확도가 지속적으로 향상되면서 대규모 실전 배포에 대한 기대가 높아졌으며, 이미 몇 가지 상용 에이전트가 선도 사용자들에게 출시 및 활용되고 있습니다. 그러나 GUI 에이전트가 시스템 구성 요소로 일상 기기에 통합되기에 우리는 정말 준비가 되었을까요? 우리는 에이전트가 실제 위협 상황에서도 성능을 유지할 수 있는지 검증하는 중요한 배포 전 확인 절차가 누락되어 있다고 주장합니다. 구체적으로, 단순한 정적 애플리케이션 콘텐츠를 기반으로 하는(다양한 테스트 간 환경 일관성을 유지하기 위해 어쩔 수 없이 이러한 방식을 취함) 기존 일반 벤치마크와 달리, 실제 애플리케이션은 광고 이메일, 사용자 생성 게시물 및 미디어 등 신뢰할 수 없는 제3자로부터의 콘텐츠로 가득 차 있습니다. ... 이를 위해 우리는 기존 애플리케이션 내에서 유연하고 목적 지향적인 콘텐츠 수정을 가능하게 하는 확장 가능한 앱 콘텐츠 계측 프레임워크를 소개합니다. 이 프레임워크를 활용하여 동적 작업 실행 환경과 도전적인 GUI 상태의 정적 데이터셋으로 구성된 테스트 스위트를 구축했습니다. 동적 환경은 122개의 재현 가능한 작업을 포함하며, 정적 데이터셋은 상용 앱으로부터 구성된 3,000개 이상의 시나리오로 이루어져 있습니다. 우리는 오픈소스 및 상용 GUI 에이전트를 대상으로 실험을 수행했습니다. 연구 결과, 조사된 모든 에이전트가 제3자 콘텐츠로 인해 성능이 크게 저하될 수 있음이 밝혀졌으며, 동적 및 정적 환경에서 각각 평균 42.0%, 36.1%의 오작동 비율을 보였습니다. 본 프레임워크와 벤치마크는 https://agenthazard.github.io에서 공개되었습니다.

English

Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse device-control tasks based on natural language instructions. The increasing accuracy of these agents on standard benchmarks has raised expectations for large-scale real-world deployment, and there are already several commercial agents released and used by early adopters. However, are we really ready for GUI agents integrated into our daily devices as system building blocks? We argue that an important pre-deployment validation is missing to examine whether the agents can maintain their performance under real-world threats. Specifically, unlike existing common benchmarks that are based on simple static app contents (they have to do so to ensure environment consistency between different tests), real-world apps are filled with contents from untrustworthy third parties, such as advertisement emails, user-generated posts and medias, etc. ... To this end, we introduce a scalable app content instrumentation framework to enable flexible and targeted content modifications within existing applications. Leveraging this framework, we create a test suite comprising both a dynamic task execution environment and a static dataset of challenging GUI states. The dynamic environment encompasses 122 reproducible tasks, and the static dataset consists of over 3,000 scenarios constructed from commercial apps. We perform experiments on both open-source and commercial GUI agents. Our findings reveal that all examined agents can be significantly degraded due to third-party contents, with an average misleading rate of 42.0% and 36.1% in dynamic and static environments respectively. The framework and benchmark has been released at https://agenthazard.github.io.

실제 위협 환경에서의 모바일 GUI 에이전트: 아직 갈 길이 먼가?

Mobile GUI Agents under Real-world Threats: Are We There Yet?

초록

Support