WebLINX：具有多轮对话的现实世界网站导航

摘要

我们提出了对话式网络导航问题，其中数字代理控制网络浏览器，并遵循用户指令以对话方式解决真实世界任务。为了支持这个问题，我们引入了WEBLINX - 一个包含100K次交互的大规模基准，跨2300个专家演示的对话式网络导航。我们的基准涵盖了150多个真实网站上的各种模式，可用于在不同场景中训练和评估代理。由于信息量巨大，大型语言模型（LLMs）无法实时处理整个网页。为了解决这一瓶颈，我们设计了一个受检索启发的模型，通过对相关元素进行排名来高效修剪HTML页面。我们使用所选元素，以及屏幕截图和操作历史，评估了各种模型在模拟人类在网络上导航时的能力。我们的实验涵盖了从小型纯文本到专有多模式LLMs的范围。我们发现，较小的微调解码器超越了最佳的零-shot LLMs（包括GPT-4V），但也超过了明确在屏幕截图上预训练的较大的微调多模式模型。然而，所有微调模型都难以推广到未知网站。我们的研究结果凸显了需要能够推广到新领域的大型多模式模型。我们的代码、数据和模型可供研究使用：https://mcgill-nlp.github.io/weblinx

English

We propose the problem of conversational web navigation, where a digital agent controls a web browser and follows user instructions to solve real-world tasks in a multi-turn dialogue fashion. To support this problem, we introduce WEBLINX - a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. Our benchmark covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios. Due to the magnitude of information present, Large Language Models (LLMs) cannot process entire web pages in real-time. To solve this bottleneck, we design a retrieval-inspired model that efficiently prunes HTML pages by ranking relevant elements. We use the selected elements, along with screenshots and action history, to assess a variety of models for their ability to replicate human behavior when navigating the web. Our experiments span from small text-only to proprietary multimodal LLMs. We find that smaller finetuned decoders surpass the best zero-shot LLMs (including GPT-4V), but also larger finetuned multimodal models which were explicitly pretrained on screenshots. However, all finetuned models struggle to generalize to unseen websites. Our findings highlight the need for large multimodal models that can generalize to novel settings. Our code, data and models are available for research: https://mcgill-nlp.github.io/weblinx

WebLINX：具有多轮对话的现实世界网站导航

WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

摘要

Support