具有规划、长期上下文理解和程序合成功能的真实世界WebAgent

摘要

最近，预训练的大型语言模型（LLMs）在自主网络导航中取得了更好的泛化能力和样本效率。然而，在真实世界的网站上，性能仍然受到以下问题的影响：（1）开放域性，（2）有限的上下文长度，以及（3）对HTML缺乏归纳偏差。我们引入了WebAgent，这是一个由LLM驱动的代理程序，可以根据自然语言指令在真实网站上完成任务。WebAgent通过将指令分解为规范子指令来提前规划，将长HTML文档总结为与任务相关的片段，并通过生成的Python程序在网站上执行这些操作。我们使用Flan-U-PaLM设计了WebAgent，用于基于代码的生成，还使用了HTML-T5，这是针对长HTML文档的新预训练LLMs，采用本地和全局注意机制以及一种混合长跨度去噪目标，用于规划和总结。我们凭经验证明，我们的方法将在真实网站上的成功率提高了50%以上，并且HTML-T5是解决基于HTML任务的最佳模型；在MiniWoB网络导航基准测试中，成功率比之前的最先进技术高出14.9％，在离线任务规划评估中也具有更高的准确性。

English

Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web navigation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that can complete the tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via generated Python programs from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our recipe improves the success on a real website by over 50%, and that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9% higher success rate than prior SoTA on the MiniWoB web navigation benchmark and better accuracy on offline task planning evaluation.

具有规划、长期上下文理解和程序合成功能的真实世界WebAgent

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

摘要

Support