OSUniverse: Benchmark per Agenti AI di Navigazione GUI Multimodali

Abstract

In questo articolo, presentiamo OSUniverse: un benchmark di compiti complessi e multimodali orientati al desktop per agenti AI avanzati di navigazione GUI, che si concentra su facilità d'uso, estensibilità, copertura completa dei casi di test e validazione automatizzata. Suddividiamo i compiti in livelli di complessità crescente, dal clic di precisione di base a test multi-step e multi-applicazione che richiedono destrezza, precisione e pensiero chiaro da parte dell'agente. Nella versione uno del benchmark, presentata qui, abbiamo calibrato la complessità dei casi di test per garantire che gli agenti SOTA (State of the Art) al momento della pubblicazione non ottengano risultati superiori al 50%, mentre un lavoratore medio da ufficio può eseguire tutti questi compiti con precisione perfetta. Il benchmark può essere valutato manualmente, ma introduciamo anche un meccanismo di validazione automatizzato con un tasso di errore medio inferiore al 2%. Pertanto, questo benchmark rappresenta una solida base per la misurazione completamente automatizzata dei progressi, delle capacità e dell'efficacia degli agenti AI di navigazione GUI nel breve e medio termine. Il codice sorgente del benchmark è disponibile all'indirizzo https://github.com/agentsea/osuniverse.

English

In this paper, we introduce OSUniverse: a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on ease of use, extensibility, comprehensive coverage of test cases, and automated validation. We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent. In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy. The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%. Therefore, this benchmark presents solid ground for fully automated measuring of progress, capabilities and the effectiveness of GUI-navigation AI agents over the short and medium-term horizon. The source code of the benchmark is available at https://github.com/agentsea/osuniverse.

OSUniverse: Benchmark per Agenti AI di Navigazione GUI Multimodali

OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents

Abstract

Support