OmniACT：用于实现桌面和Web多模态通用自主代理的数据集和基准。

摘要

几十年来，人机交互基本上是手动的。即使在今天，几乎所有在计算机上进行的生产性工作都需要人类的每一步输入。自主虚拟代理标志着自动化许多这些琐碎任务的激动人心的一步。虚拟代理将赋予技术能力有限的用户利用计算机系统的全部可能性。它们还可以实现高效地简化许多计算机任务，从日历管理到复杂的旅行预订，减少人类干预。在本文中，我们介绍了OmniACT，这是一个首创的数据集和基准，用于评估代理生成可执行程序以完成计算机任务的能力。我们的范围超越了传统的网络自动化，涵盖了各种桌面应用程序。数据集包括基本任务，如“播放下一首歌曲”，以及更长期的任务，如“给John Doe发送一封电子邮件，提到见面的时间和地点”。具体来说，给定一对屏幕图像和一个视觉相关的自然语言任务，目标是生成一个能够完全执行任务的脚本。我们在我们的基准测试中运行了几个强大的基线语言模型代理。最强大的基线模型GPT-4在我们的基准测试中表现最佳。然而，它的性能水平仍然只达到了人类在生成能够完成任务的可执行脚本方面的15%的熟练水平，展示了我们的任务对于传统网络代理的挑战。我们的基准测试提供了一个平台，用于衡量和评估语言模型代理在自动化计算机任务方面的进展，并激励未来工作朝着构建桥接大型语言模型和计算机屏幕视觉基础的多模型的方向发展。

English

For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens.

OmniACT：用于实现桌面和Web多模态通用自主代理的数据集和基准。

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

摘要

Summary

Support