WebLINX: 다중 턴 대화를 통한 실세계 웹사이트 탐색

초록

우리는 디지털 에이전트가 웹 브라우저를 제어하고 사용자 지시에 따라 다중 턴 대화 방식으로 실제 작업을 해결하는 대화형 웹 탐색 문제를 제안한다. 이 문제를 지원하기 위해, 우리는 2300개의 전문가 시연을 통해 100,000건의 상호작용을 포함한 대규모 벤치마크인 WEBLINX를 소개한다. 우리의 벤치마크는 150개 이상의 실제 웹사이트에서 다양한 패턴을 다루며, 다양한 시나리오에서 에이전트를 훈련하고 평가하는 데 사용할 수 있다. 대량의 정보로 인해 대형 언어 모델(LLM)은 실시간으로 전체 웹 페이지를 처리할 수 없다. 이러한 병목 현상을 해결하기 위해, 우리는 관련 요소를 순위별로 정리하여 HTML 페이지를 효율적으로 정제하는 검색 기반 모델을 설계했다. 선택된 요소와 스크린샷, 행동 이력을 사용하여 웹 탐색 시 인간의 행동을 모방하는 다양한 모델의 능력을 평가한다. 우리의 실험은 소규모 텍스트 전용 모델부터 독점적인 다중 모드 LLM까지 광범위하게 걸쳐 있다. 우리는 소규모로 미세 조정된 디코더가 최고의 제로샷 LLM(예: GPT-4V)을 능가할 뿐만 아니라, 스크린샷에 대해 명시적으로 사전 훈련된 더 큰 다중 모드 모델도 능가한다는 것을 발견했다. 그러나 모든 미세 조정된 모델은 보지 못한 웹사이트에 일반화하는 데 어려움을 겪는다. 우리의 연구 결과는 새로운 환경에 일반화할 수 있는 대형 다중 모드 모델의 필요성을 강조한다. 우리의 코드, 데이터 및 모델은 연구 목적으로 이용 가능하다: https://mcgill-nlp.github.io/weblinx

English

We propose the problem of conversational web navigation, where a digital agent controls a web browser and follows user instructions to solve real-world tasks in a multi-turn dialogue fashion. To support this problem, we introduce WEBLINX - a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. Our benchmark covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios. Due to the magnitude of information present, Large Language Models (LLMs) cannot process entire web pages in real-time. To solve this bottleneck, we design a retrieval-inspired model that efficiently prunes HTML pages by ranking relevant elements. We use the selected elements, along with screenshots and action history, to assess a variety of models for their ability to replicate human behavior when navigating the web. Our experiments span from small text-only to proprietary multimodal LLMs. We find that smaller finetuned decoders surpass the best zero-shot LLMs (including GPT-4V), but also larger finetuned multimodal models which were explicitly pretrained on screenshots. However, all finetuned models struggle to generalize to unseen websites. Our findings highlight the need for large multimodal models that can generalize to novel settings. Our code, data and models are available for research: https://mcgill-nlp.github.io/weblinx

WebLINX: 다중 턴 대화를 통한 실세계 웹사이트 탐색

WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

초록

Support