WebVoyager: 대규모 멀티모달 모델을 활용한 종단간 웹 에이전트 구축

초록

대규모 언어 모델(LLMs)의 발전은 현실 세계에서의 자율적 애플리케이션 개발로 특징지어지는 새로운 시대를 열어가며, 고급 웹 기반 에이전트의 혁신을 주도하고 있다. 기존의 웹 에이전트는 일반적으로 단일 입력 양식만을 처리하며, 단순화된 웹 시뮬레이터나 정적 웹 스냅샷에서만 평가되어, 현실 세계 시나리오에서의 적용 가능성이 크게 제한되어 왔다. 이러한 격차를 해소하기 위해, 우리는 실제 웹사이트와 상호작용하여 사용자 지시를 종단 간 완료할 수 있는 혁신적인 대형 멀티모달 모델(LMM) 기반 웹 에이전트인 WebVoyager를 소개한다. 또한, GPT-4V의 강력한 멀티모달 이해 능력을 활용하여 개방형 웹 에이전트 작업의 자동 평가에 대한 도전 과제를 해결하기 위한 새로운 평가 프로토콜을 제안한다. 우리는 15개의 널리 사용되는 웹사이트에서 실제 작업을 수집하여 새로운 벤치마크를 생성하고, 이를 통해 우리의 에이전트를 평가한다. WebVoyager는 55.7%의 작업 성공률을 달성하며, GPT-4(All Tools) 및 WebVoyager(텍스트 전용) 설정의 성능을 크게 능가함으로써, WebVoyager의 실질적인 응용에서의 탁월한 능력을 입증한다. 우리가 제안한 자동 평가는 인간의 판단과 85.3%의 일치율을 보이며, 현실 세계 설정에서 웹 에이전트의 추가 개발을 위한 길을 열어준다.

English

The advancement of large language models (LLMs) leads to a new era marked by the development of autonomous applications in the real world, which drives innovation in the creation of advanced web-based agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we propose a new evaluation protocol for web agents to address the challenges of automatic evaluation of open-ended web agent tasks, leveraging the robust multimodal comprehension capabilities of GPT-4V. We create a new benchmark by gathering real-world tasks from 15 widely used websites to evaluate our agents. We show that WebVoyager achieves a 55.7% task success rate, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager in practical applications. We found that our proposed automatic evaluation achieves 85.3% agreement with human judgment, paving the way for further development of web agents in a real-world setting.

WebVoyager: 대규모 멀티모달 모델을 활용한 종단간 웹 에이전트 구축

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

초록

Support