OSUniverse:多模態GUI導航AI代理的基準測試平台
OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents
May 6, 2025
作者: Mariya Davydova, Daniel Jeffries, Patrick Barker, Arturo Márquez Flores, Sinéad Ryan
cs.AI
摘要
在本論文中,我們介紹了OSUniverse:一個專為高級GUI導航AI代理設計的複雜多模態桌面任務基準測試,該測試著重於易用性、可擴展性、測試案例的全面覆蓋以及自動化驗證。我們將任務按複雜度遞增劃分,從基本的精確點擊到需要代理具備靈巧性、精確度和清晰思維的多步驟、多應用程序測試。在此介紹的第一版基準測試中,我們已校準了測試案例的複雜度,確保在發佈時,最先進的代理(SOTA)的表現不超過50%,而普通白領工作者則能完美完成所有任務。此基準測試可手動評分,但我們也引入了一個平均錯誤率低於2%的自動化驗證機制。因此,該基準測試為短期和中期的GUI導航AI代理的進展、能力及有效性提供了堅實的自動化測量基礎。基準測試的源代碼可在https://github.com/agentsea/osuniverse獲取。
English
In this paper, we introduce OSUniverse: a benchmark of complex, multimodal
desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on
ease of use, extensibility, comprehensive coverage of test cases, and automated
validation. We divide the tasks in increasing levels of complexity, from basic
precision clicking to multistep, multiapplication tests requiring dexterity,
precision, and clear thinking from the agent. In version one of the benchmark,
presented here, we have calibrated the complexity of the benchmark test cases
to ensure that the SOTA (State of the Art) agents (at the time of publication)
do not achieve results higher than 50%, while the average white collar worker
can perform all these tasks with perfect accuracy. The benchmark can be scored
manually, but we also introduce an automated validation mechanism that has an
average error rate less than 2%. Therefore, this benchmark presents solid
ground for fully automated measuring of progress, capabilities and the
effectiveness of GUI-navigation AI agents over the short and medium-term
horizon. The source code of the benchmark is available at
https://github.com/agentsea/osuniverse.Summary
AI-Generated Summary