OmniACT:用于实现桌面和Web多模态通用自主代理的数据集和基准。
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
February 27, 2024
作者: Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov
cs.AI
摘要
几十年来,人机交互基本上是手动的。即使在今天,几乎所有在计算机上进行的生产性工作都需要人类的每一步输入。自主虚拟代理标志着自动化许多这些琐碎任务的激动人心的一步。虚拟代理将赋予技术能力有限的用户利用计算机系统的全部可能性。它们还可以实现高效地简化许多计算机任务,从日历管理到复杂的旅行预订,减少人类干预。在本文中,我们介绍了OmniACT,这是一个首创的数据集和基准,用于评估代理生成可执行程序以完成计算机任务的能力。我们的范围超越了传统的网络自动化,涵盖了各种桌面应用程序。数据集包括基本任务,如“播放下一首歌曲”,以及更长期的任务,如“给John Doe发送一封电子邮件,提到见面的时间和地点”。具体来说,给定一对屏幕图像和一个视觉相关的自然语言任务,目标是生成一个能够完全执行任务的脚本。我们在我们的基准测试中运行了几个强大的基线语言模型代理。最强大的基线模型GPT-4在我们的基准测试中表现最佳。然而,它的性能水平仍然只达到了人类在生成能够完成任务的可执行脚本方面的15%的熟练水平,展示了我们的任务对于传统网络代理的挑战。我们的基准测试提供了一个平台,用于衡量和评估语言模型代理在自动化计算机任务方面的进展,并激励未来工作朝着构建桥接大型语言模型和计算机屏幕视觉基础的多模型的方向发展。
English
For decades, human-computer interaction has fundamentally been manual. Even
today, almost all productive work done on the computer necessitates human input
at every step. Autonomous virtual agents represent an exciting step in
automating many of these menial tasks. Virtual agents would empower users with
limited technical proficiency to harness the full possibilities of computer
systems. They could also enable the efficient streamlining of numerous computer
tasks, ranging from calendar management to complex travel bookings, with
minimal human intervention. In this paper, we introduce OmniACT, the
first-of-a-kind dataset and benchmark for assessing an agent's capability to
generate executable programs to accomplish computer tasks. Our scope extends
beyond traditional web automation, covering a diverse range of desktop
applications. The dataset consists of fundamental tasks such as "Play the next
song", as well as longer horizon tasks such as "Send an email to John Doe
mentioning the time and place to meet". Specifically, given a pair of screen
image and a visually-grounded natural language task, the goal is to generate a
script capable of fully executing the task. We run several strong baseline
language model agents on our benchmark. The strongest baseline, GPT-4, performs
the best on our benchmark However, its performance level still reaches only 15%
of the human proficiency in generating executable scripts capable of completing
the task, demonstrating the challenge of our task for conventional web agents.
Our benchmark provides a platform to measure and evaluate the progress of
language model agents in automating computer tasks and motivates future work
towards building multimodal models that bridge large language models and the
visual grounding of computer screens.Summary
AI-Generated Summary