AutoGUI-v2：包括的なマルチモーダルGUI機能理解ベンチマーク

要旨

グラフィカルユーザインタフェース（GUI）を操作可能な自律エージェントは、デジタル生産性に革命をもたらす可能性を秘めている。しかし、真のデジタル自律性を実現するには、反応的な要素マッチングを超えて、インタフェースの動的変化に関する予測的なメンタルモデルと、操作によって生じる「デジタル世界の状態」を先読みする能力が不可欠である。現代の視覚言語モデル（VLM）は知覚能力を有するにもかかわらず、既存のベンチマークは二分されたまま（ブラックボックスのタスク完了、あるいは静的な浅い接地のいずれかに焦点を当てている）であり、エージェントがGUIの暗黙的な機能性と状態遷移論理を真に理解しているかどうかを評価できていない。この隔たりを埋めるため、我々は深層的なGUI機能理解と操作結果予測を評価する包括的ベンチマークであるAutoGUI-v2を提案する。本ベンチマークは、マルチプラットフォームのスクリーンショットを階層的な機能領域に再帰的に解析し、多様な評価タスクを生成する新しいVLM-人間協調パイプラインを用いて構築した。6つのオペレーティングシステムにわたる2,753のタスクを提供するAutoGUI-v2は、領域および要素レベルの意味論、接地、動的状態予測についてエージェントを厳密に試験する。評価結果から、VLMに顕著な二極化が明らかになった：エージェントデータでファインチューニングされたオープンソースモデル（例：Qwen3-VL）は機能的接地で優れる一方、商用モデル（例：Gemini-2.5-Pro-Thinking）は機能説明で優位を示した。決定的に、全てのモデルは稀な操作における複雑な相互作用論理に対処するのに苦戦しており、深い機能理解が依然として重大な障壁であることが強調される。これらの基盤能力を体系的に測定することで、AutoGUI-v2は次世代GUIエージェントの発展に向けた新たな視点を提供する。

English

Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.

AutoGUI-v2：包括的なマルチモーダルGUI機能理解ベンチマーク

AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

要旨

Support