WebVoyager:使用大型多模态模型构建端到端网络代理
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
January 25, 2024
作者: Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, Dong Yu
cs.AI
摘要
大型语言模型(LLMs)的进步引领着一个新时代的到来,标志着在现实世界中开发自主应用程序的发展,推动了先进基于网络的代理程序的创新。现有的网络代理通常只处理一种输入模态,并且仅在简化的网络模拟器或静态网络快照中进行评估,极大地限制了它们在真实场景中的适用性。为了弥合这一差距,我们引入了WebVoyager,这是一个创新的大型多模态模型(LMM)驱动的网络代理,可以通过与真实网站的交互来完整地执行用户指令。此外,我们提出了一种新的网络代理评估协议,以解决开放式网络代理任务的自动评估挑战,利用了GPT-4V强大的多模态理解能力。我们通过收集来自15个广泛使用的网站的真实任务来评估我们的代理,创建了一个新的基准。我们展示了WebVoyager实现了55.7%的任务成功率,明显超过了GPT-4(所有工具)和WebVoyager(仅文本)设置的表现,突显了WebVoyager在实际应用中的卓越能力。我们发现,我们提出的自动评估与人类判断达成了85.3%的一致性,为在真实世界环境中进一步发展网络代理铺平了道路。
English
The advancement of large language models (LLMs) leads to a new era marked by
the development of autonomous applications in the real world, which drives
innovation in the creation of advanced web-based agents. Existing web agents
typically only handle one input modality and are evaluated only in simplified
web simulators or static web snapshots, greatly limiting their applicability in
real-world scenarios. To bridge this gap, we introduce WebVoyager, an
innovative Large Multimodal Model (LMM) powered web agent that can complete
user instructions end-to-end by interacting with real-world websites. Moreover,
we propose a new evaluation protocol for web agents to address the challenges
of automatic evaluation of open-ended web agent tasks, leveraging the robust
multimodal comprehension capabilities of GPT-4V. We create a new benchmark by
gathering real-world tasks from 15 widely used websites to evaluate our agents.
We show that WebVoyager achieves a 55.7% task success rate, significantly
surpassing the performance of both GPT-4 (All Tools) and the WebVoyager
(text-only) setups, underscoring the exceptional capability of WebVoyager in
practical applications. We found that our proposed automatic evaluation
achieves 85.3% agreement with human judgment, paving the way for further
development of web agents in a real-world setting.