ChatPaper.aiChatPaper

OSUniverse:多模态GUI导航AI智能体基准测试平台

OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents

May 6, 2025
作者: Mariya Davydova, Daniel Jeffries, Patrick Barker, Arturo Márquez Flores, Sinéad Ryan
cs.AI

摘要

本文介绍了OSUniverse:一个面向高级GUI导航AI代理的复杂多模态桌面任务基准,该基准着重于易用性、可扩展性、测试案例的全面覆盖以及自动化验证。我们将任务按复杂度递增划分,从基本的精确点击到需要代理具备灵活性、精确性和清晰思维的多步骤、多应用程序测试。在本文展示的基准第一版中,我们已校准了测试案例的复杂度,确保在发布时,最先进的(SOTA)代理无法取得超过50%的成绩,而普通白领工作者则能完美完成所有任务。该基准可手动评分,但我们还引入了一个平均错误率低于2%的自动化验证机制。因此,这一基准为全面自动化衡量GUI导航AI代理在短期和中期的进展、能力及有效性提供了坚实基础。基准的源代码可在https://github.com/agentsea/osuniverse获取。
English
In this paper, we introduce OSUniverse: a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on ease of use, extensibility, comprehensive coverage of test cases, and automated validation. We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent. In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy. The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%. Therefore, this benchmark presents solid ground for fully automated measuring of progress, capabilities and the effectiveness of GUI-navigation AI agents over the short and medium-term horizon. The source code of the benchmark is available at https://github.com/agentsea/osuniverse.

Summary

AI-Generated Summary

PDF51May 8, 2025