VideoWebArena：ビデオを用いた長いコンテキストのマルチモーダルエージェントを評価するためのWebタスク理解

要旨

動画は、テキストや静止画像だけでは提供できない方法で、学習や必要な情報の抽出にしばしば使用されます。しかし、多くの既存のエージェントのベンチマークは、長いコンテキストの動画理解を無視し、代わりにテキストや静止画像の入力に焦点を当てています。このギャップを埋めるために、私たちはVideoWebArena（VideoWA）を導入し、動画理解のための長いコンテキストのマルチモーダルエージェントの能力を評価するためのベンチマークを提供します。VideoWAには、手作業で作成された動画チュートリアルに基づく2,021のWebエージェントタスクが含まれており、合計約4時間のコンテンツが含まれています。私たちのベンチマークでは、長いコンテキストの動画ベースのエージェントタスクのためのタクソノミーを定義し、スキル保持と事実保持の2つの主要な焦点領域を持っています。スキル保持タスクは、エージェントが与えられた人間のデモンストレーションを使用してタスクを効率的に完了できるかどうかを評価し、事実保持タスクは、エージェントがタスクを完了するために動画から指示に関連する情報を取得できるかどうかを評価します。最良のモデルは、事実保持タスクで13.3％の成功率、事実保持QAペアで45.8％の成功率を達成しましたが、これは人間の73.9％と79.3％と比べて大幅に低いです。スキル保持タスクでは、長いコンテキストモデルはチュートリアルを使用する場合よりも悪い結果を示し、WebArenaタスクでは5％、VisualWebArenaタスクでは10.3％の性能低下が見られました。私たちの研究は、長いコンテキストのマルチモーダルモデルのエージェント能力を向上させる必要性を示し、長いコンテキストの動画エージェントの将来の開発のためのテストベッドを提供しています。

English

Videos are often used to learn or extract the necessary information to complete tasks in ways different than what text and static imagery alone can provide. However, many existing agent benchmarks neglect long-context video understanding, instead focusing on text or static image inputs. To bridge this gap, we introduce VideoWebArena (VideoWA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. VideoWA consists of 2,021 web agent tasks based on manually crafted video tutorials, which total almost four hours of content. For our benchmark, we define a taxonomy of long-context video-based agent tasks with two main areas of focus: skill retention and factual retention. While skill retention tasks evaluate whether an agent can use a given human demonstration to complete a task efficiently, the factual retention task evaluates whether an agent can retrieve instruction-relevant information from a video to complete a task. We find that the best model achieves 13.3% success on factual retention tasks and 45.8% on factual retention QA pairs, far below human performance at 73.9% and 79.3%, respectively. On skill retention tasks, long-context models perform worse with tutorials than without, exhibiting a 5% performance decrease in WebArena tasks and a 10.3% decrease in VisualWebArena tasks. Our work highlights the need to improve the agentic abilities of long-context multimodal models and provides a testbed for future development with long-context video agents.

VideoWebArena：ビデオを用いた長いコンテキストのマルチモーダルエージェントを評価するためのWebタスク理解

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

要旨

Support