AutoGUI-v2: Een Uitgebreide Multi-Modale Benchmark voor het Begrijpen van GUI-functionaliteiten

Samenvatting

Autonome agents die in staat zijn om grafische gebruikersinterfaces (GUI's) te navigeren, hebben het potentieel om de digitale productiviteit te revolutionariseren. Het bereiken van ware digitale autonomie gaat echter verder dan reactieve elementherkenning; het vereist een voorspellend mentaal model van interfacedynamiek en het vermogen om de "digitale wereldstatus" te voorzien die uit interacties voortvloeit. Ondanks de perceptuele capaciteiten van moderne Vision-Language Models (VLM's) blijven bestaande benchmarks gespleten (gericht op óf black-box taakuitvoering óf statische, oppervlakkige grounding), waardoor ze niet kunnen beoordelen of agents daadwerkelijk de impliciete functionaliteit en overgangslogica van GUI's begrijpen. Om deze kloof te overbruggen, introduceren wij AutoGUI-v2, een uitgebreide benchmark ontworpen om diepgaand functioneel begrip van GUI's en voorspelling van interactieresultaten te evalueren. Wij construeren de benchmark met behulp van een nieuwe VLM-menselijke collaboratieve pijplijn die recursief screenshots van meerdere platformen parseert in hiërarchische functionele regio's om diverse evaluatietaken te genereren. Met 2.753 taken verspreid over zes besturingssystemen, test AutoGUI-v2 agents rigoureus op semantiek, grounding en dynamische statusvoorspelling op region- en elementniveau. Onze evaluatie onthult een opvallende tweedeling in VLM's: hoewel open-source modellen die zijn gefinetuned op agentdata (bijv. Qwen3-VL) uitblinken in functionele grounding, domineren commerciële modellen (bijv. Gemini-2.5-Pro-Thinking) in functionele beschrijving. Cruciaal is dat alle modellen moeite hebben met de complexe interactielogica van ongebruikelijke acties, wat aantoont dat diep functioneel begrip een significante horde blijft. Door deze fundamentele capaciteiten systematisch te meten, biedt AutoGUI-v2 een nieuwe lens voor de vooruitgang van de volgende generatie GUI-agents.

English

Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.

AutoGUI-v2: Een Uitgebreide Multi-Modale Benchmark voor het Begrijpen van GUI-functionaliteiten

AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

Samenvatting

Support