OmniACT: デスクトップおよびWeb向けマルチモーダル汎用自律エージェントを実現するためのデータセットとベンチマーク

要旨

数十年にわたり、人間とコンピュータのインタラクションは基本的に手動で行われてきました。今日でも、コンピュータ上で行われる生産的な作業のほとんどは、各ステップで人間の入力を必要とします。自律型仮想エージェントは、これらの単純なタスクの多くを自動化するためのエキサイティングな一歩を表しています。仮想エージェントは、技術的な熟練度が限られたユーザーがコンピュータシステムの可能性を最大限に活用できるようにするでしょう。また、カレンダー管理から複雑な旅行予約まで、数多くのコンピュータタスクを最小限の人間の介入で効率的に合理化することも可能にします。本論文では、コンピュータタスクを達成するための実行可能なプログラムを生成するエージェントの能力を評価するための初のデータセットおよびベンチマークであるOmniACTを紹介します。私たちの範囲は従来のウェブ自動化を超え、多様なデスクトップアプリケーションをカバーしています。データセットは「次の曲を再生する」といった基本的なタスクから、「ジョン・ドウに会う時間と場所を記載したメールを送信する」といった長期的なタスクまで含んでいます。具体的には、スクリーン画像と視覚的に基づいた自然言語タスクのペアが与えられた場合、そのタスクを完全に実行可能なスクリプトを生成することが目標です。私たちは、いくつかの強力なベースライン言語モデルエージェントをベンチマークで実行しました。最も強力なベースラインであるGPT-4は、私たちのベンチマークで最高のパフォーマンスを示しましたが、その性能レベルはタスクを完了するための実行可能なスクリプトを生成する人間の熟練度のわずか15％に留まっており、従来のウェブエージェントにとっての課題を示しています。私たちのベンチマークは、コンピュータタスクの自動化における言語モデルエージェントの進歩を測定・評価するためのプラットフォームを提供し、大規模言語モデルとコンピュータスクリーンの視覚的基盤を結びつけるマルチモーダルモデルの構築に向けた将来の研究を動機づけます。

English

For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens.

OmniACT: デスクトップおよびWeb向けマルチモーダル汎用自律エージェントを実現するためのデータセットとベンチマーク

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

要旨

Support