OmniACT:用於啟用桌面和網頁多模式通用自主代理的數據集和基準。
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
February 27, 2024
作者: Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov
cs.AI
摘要
數十年來,人機互動基本上一直是手動進行的。即使在今天,幾乎所有在電腦上進行的生產性工作都需要人類在每一步驟上進行輸入。自主虛擬代理人代表了自動化許多這些瑣碎任務的一個令人興奮的步驟。虛擬代理人將使技術能力有限的用戶能夠充分利用計算機系統的全部可能性。它們還可以實現有效地優化許多計算機任務,從日曆管理到複雜的旅行預訂,只需極少的人類干預。在本文中,我們介紹了 OmniACT,這是用於評估代理人生成可執行程序以完成計算機任務能力的首個數據集和基準。我們的範圍超出了傳統的網絡自動化,涵蓋了各種桌面應用程序。數據集包括基本任務,如“播放下一首歌曲”,以及較長期的任務,如“發送電子郵件給約翰·杜,提及會面的時間和地點”。具體來說,給定一對屏幕圖像和視覺導向的自然語言任務,目標是生成能夠完全執行任務的腳本。我們在我們的基準上運行了幾個強大的基準語言模型代理。最強的基準,GPT-4,在我們的基準上表現最佳。然而,它的表現水平仍然只達到了人類在生成能夠完成任務的可執行腳本方面的 15% 的熟練度,這表明了我們的任務對於傳統網絡代理的挑戰。我們的基準提供了一個平台,用於測量和評估語言模型代理在自動化計算機任務方面的進展,並激勵未來工作朝著構建將大型語言模型和計算機屏幕的視覺基礎相結合的多模型的方向發展。
English
For decades, human-computer interaction has fundamentally been manual. Even
today, almost all productive work done on the computer necessitates human input
at every step. Autonomous virtual agents represent an exciting step in
automating many of these menial tasks. Virtual agents would empower users with
limited technical proficiency to harness the full possibilities of computer
systems. They could also enable the efficient streamlining of numerous computer
tasks, ranging from calendar management to complex travel bookings, with
minimal human intervention. In this paper, we introduce OmniACT, the
first-of-a-kind dataset and benchmark for assessing an agent's capability to
generate executable programs to accomplish computer tasks. Our scope extends
beyond traditional web automation, covering a diverse range of desktop
applications. The dataset consists of fundamental tasks such as "Play the next
song", as well as longer horizon tasks such as "Send an email to John Doe
mentioning the time and place to meet". Specifically, given a pair of screen
image and a visually-grounded natural language task, the goal is to generate a
script capable of fully executing the task. We run several strong baseline
language model agents on our benchmark. The strongest baseline, GPT-4, performs
the best on our benchmark However, its performance level still reaches only 15%
of the human proficiency in generating executable scripts capable of completing
the task, demonstrating the challenge of our task for conventional web agents.
Our benchmark provides a platform to measure and evaluate the progress of
language model agents in automating computer tasks and motivates future work
towards building multimodal models that bridge large language models and the
visual grounding of computer screens.Summary
AI-Generated Summary