WebVoyager: 大規模マルチモーダルモデルを用いたエンドツーエンドWebエージェントの構築

要旨

大規模言語モデル（LLM）の進化は、現実世界における自律アプリケーションの開発を特徴とする新たな時代を切り開き、高度なウェブベースエージェントの創出におけるイノベーションを推進しています。既存のウェブエージェントは通常、単一の入力モダリティしか扱わず、簡略化されたウェブシミュレーターや静的なウェブスナップショットでのみ評価されるため、現実世界のシナリオでの適用性が大幅に制限されています。このギャップを埋めるため、我々はWebVoyagerを紹介します。これは、現実世界のウェブサイトと対話することでユーザーの指示をエンドツーエンドで完了できる革新的な大規模マルチモーダルモデル（LMM）を搭載したウェブエージェントです。さらに、オープンエンドのウェブエージェントタスクの自動評価における課題に対処するため、GPT-4Vの強力なマルチモーダル理解能力を活用した新しい評価プロトコルを提案します。我々は、15の広く使用されているウェブサイトから現実世界のタスクを収集し、エージェントを評価するための新しいベンチマークを作成しました。WebVoyagerは55.7%のタスク成功率を達成し、GPT-4（All Tools）およびWebVoyager（テキストのみ）のセットアップを大幅に上回る性能を示し、実用面での卓越した能力を強調しています。また、提案した自動評価は人間の判断と85.3%の一致率を達成し、現実世界の設定におけるウェブエージェントのさらなる発展の道を切り開いています。

English

The advancement of large language models (LLMs) leads to a new era marked by the development of autonomous applications in the real world, which drives innovation in the creation of advanced web-based agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we propose a new evaluation protocol for web agents to address the challenges of automatic evaluation of open-ended web agent tasks, leveraging the robust multimodal comprehension capabilities of GPT-4V. We create a new benchmark by gathering real-world tasks from 15 widely used websites to evaluate our agents. We show that WebVoyager achieves a 55.7% task success rate, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager in practical applications. We found that our proposed automatic evaluation achieves 85.3% agreement with human judgment, paving the way for further development of web agents in a real-world setting.

WebVoyager: 大規模マルチモーダルモデルを用いたエンドツーエンドWebエージェントの構築

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

要旨

Support