ChatPaper.aiChatPaper

WebVoyager:使用大型多模型建立端對端網頁代理程式

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

January 25, 2024
作者: Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, Dong Yu
cs.AI

摘要

大型語言模型(LLMs)的進步引領了一個新時代的到來,這個時代以在現實世界中開發自主應用為特徵,推動了在創建先進的基於網絡的代理人方面的創新。現有的網絡代理通常只處理一種輸入模式,並且僅在簡化的網絡模擬器或靜態網頁快照中進行評估,這大大限制了它們在現實世界情境中的應用。為了彌合這一差距,我們引入了WebVoyager,一個創新的大型多模型模型(LMM)驅動的網絡代理,可以通過與現實世界網站的互動來完整地完成用戶指令。此外,我們提出了一種新的網絡代理評估協議,以應對開放式網絡代理任務的自動評估挑戰,利用了GPT-4V的強大多模型理解能力。我們通過從15個廣泛使用的網站中收集現實世界任務來評估我們的代理人,創建了一個新的基準。我們展示了WebVoyager實現了55.7%的任務成功率,顯著超越了GPT-4(所有工具)和WebVoyager(僅文本)設置的表現,凸顯了WebVoyager在實際應用中的卓越能力。我們發現,我們提出的自動評估與人類判斷達到了85.3%的一致性,為在現實世界環境中進一步發展網絡代理鋪平了道路。
English
The advancement of large language models (LLMs) leads to a new era marked by the development of autonomous applications in the real world, which drives innovation in the creation of advanced web-based agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we propose a new evaluation protocol for web agents to address the challenges of automatic evaluation of open-ended web agent tasks, leveraging the robust multimodal comprehension capabilities of GPT-4V. We create a new benchmark by gathering real-world tasks from 15 widely used websites to evaluate our agents. We show that WebVoyager achieves a 55.7% task success rate, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager in practical applications. We found that our proposed automatic evaluation achieves 85.3% agreement with human judgment, paving the way for further development of web agents in a real-world setting.
PDF324December 15, 2024