AutoGUI-v2: 포괄적인 멀티모달 GUI 기능 이해 벤치마크

초록

그래픽 사용자 인터페이스(GUI)를 탐색할 수 있는 자율 에이전트는 디지털 생산성을 혁신할 잠재력을 지닙니다. 그러나 진정한 디지털 자율성을 달성하는 것은 반응형 요소 매칭을 넘어, 인터페이스 역학에 대한 예측적 멘탈 모델과 상호작용 결과로서의 "디지털 세계 상태"를 예측하는 능력을 필요로 합니다. 현대 시각-언어 모델(VLM)의 인지 능력에도 불구하고, 기존 벤치마크는 블랙박스 작업 완료 또는 정적이고 피상적인 그라운딩에만 집중하는 이분화된 상태로 남아 있어, 에이전트가 GUI의 암묵적 기능과 전환 논리를 진정으로 이해하는지 평가하지 못하고 있습니다. 이러한 격차를 해소하기 위해, 우리는 심층 GUI 기능 이해와 상호작용 결과 예측을 평가하기 위한 포괄적인 벤치마크인 AutoGUI-v2를 소개합니다. 우리는 다중 플랫폼 스크린샷을 계층적 기능 영역으로 재귀적으로 파싱하여 다양한 평가 과제를 생성하는 새로운 VLM-인간 협업 파이프라인을 사용하여 벤치마크를 구축했습니다. 6가지 운영 체제에 걸쳐 2,753개의 과제를 제공하는 AutoGUI-v2는 영역 및 요소 수준의 의미론, 그라운딩, 동적 상태 예측에 대해 에이전트를 엄격하게 테스트합니다. 우리의 평가는 VLM에서 놀라운 이분법을 드러냅니다: 에이전트 데이터로 미세 조정된 오픈소스 모델(예: Qwen3-VL)은 기능 그라운딩에서 뛰어난 반면, 상용 모델(예: Gemini-2.5-Pro-Thinking)은 기능 캡션 작성에서 압도적입니다. 결정적으로, 모든 모델은 흔하지 않은 동작의 복잡한 상호작용 논리에서 어려움을 겪으며, 심층 기능 이해가 여전히 큰 장벽임을 강조합니다. 이러한 기초 능력을 체계적으로 측정함으로써, AutoGUI-v2는 차세대 GUI 에이전트 발전을 위한 새로운 렌즈를 제공합니다.

English

Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.

AutoGUI-v2: 포괄적인 멀티모달 GUI 기능 이해 벤치마크

AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

초록

Support