명령어 미세조정 기반 모델을 활용한 멀티모달 웹 탐색

초록

자율 웹 탐색의 발전은 온라인 강화 학습을 통한 수십억 건의 탐색적 상호작용에 대한 의존성과, 풍부한 도메인 외 데이터로부터의 일반화를 활용하기 어렵게 만드는 도메인 특화 모델 설계로 인해 지체되어 왔습니다. 본 연구에서는 비전-언어 기반 모델을 활용한 웹 에이전트의 데이터 기반 오프라인 훈련을 탐구합니다. 우리는 웹페이지 스크린샷과 HTML 페이지를 모두 관찰하고 클릭 및 입력과 같은 웹 탐색 동작을 출력하는 명령어 기반 멀티모달 에이전트인 WebGUM을 제안합니다. WebGUM은 명령어 미세 조정된 언어 모델과 비전 트랜스포머를 대규모 데모 데이터셋에서 공동으로 미세 조정하여 훈련됩니다. 우리는 이 방법론이 에이전트의 시각적 인지, HTML 이해 및 다단계 추론 능력을 향상시키며, 기존 연구를 상당한 차이로 능가함을 실증적으로 입증합니다. MiniWoB 벤치마크에서는 이전 최고의 오프라인 방법 대비 31.9% 이상의 성능 향상을 달성하며, 온라인 미세 조정된 최신 기술(SoTA)에 근접한 성능을 보입니다. WebShop 벤치마크에서는 30억 파라미터 모델이 기존 SoTA인 PaLM-540B를 능가하는 우수한 성능을 달성합니다. 또한, 우리는 훈련된 모델을 사용하여 347,000건의 고품질 데모 데이터를 수집하였으며, 이는 기존 연구 대비 38배 규모로, 이 방향의 향후 연구를 촉진하기 위해 공개합니다.

English

The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision transformer on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded visual perception, HTML comprehension and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB benchmark, we improve over the previous best offline methods by more than 31.9%, being close to reaching online-finetuned SoTA. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.

명령어 미세조정 기반 모델을 활용한 멀티모달 웹 탐색

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

초록

Support