具有规划、长期上下文理解和程序合成功能的真实世界WebAgent
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
July 24, 2023
作者: Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust
cs.AI
摘要
最近,预训练的大型语言模型(LLMs)在自主网络导航中取得了更好的泛化能力和样本效率。然而,在真实世界的网站上,性能仍然受到以下问题的影响:(1)开放域性,(2)有限的上下文长度,以及(3)对HTML缺乏归纳偏差。我们引入了WebAgent,这是一个由LLM驱动的代理程序,可以根据自然语言指令在真实网站上完成任务。WebAgent通过将指令分解为规范子指令来提前规划,将长HTML文档总结为与任务相关的片段,并通过生成的Python程序在网站上执行这些操作。我们使用Flan-U-PaLM设计了WebAgent,用于基于代码的生成,还使用了HTML-T5,这是针对长HTML文档的新预训练LLMs,采用本地和全局注意机制以及一种混合长跨度去噪目标,用于规划和总结。我们凭经验证明,我们的方法将在真实网站上的成功率提高了50%以上,并且HTML-T5是解决基于HTML任务的最佳模型;在MiniWoB网络导航基准测试中,成功率比之前的最先进技术高出14.9%,在离线任务规划评估中也具有更高的准确性。
English
Pre-trained large language models (LLMs) have recently achieved better
generalization and sample efficiency in autonomous web navigation. However, the
performance on real-world websites has still suffered from (1) open domainness,
(2) limited context length, and (3) lack of inductive bias on HTML. We
introduce WebAgent, an LLM-driven agent that can complete the tasks on real
websites following natural language instructions. WebAgent plans ahead by
decomposing instructions into canonical sub-instructions, summarizes long HTML
documents into task-relevant snippets, and acts on websites via generated
Python programs from those. We design WebAgent with Flan-U-PaLM, for grounded
code generation, and HTML-T5, new pre-trained LLMs for long HTML documents
using local and global attention mechanisms and a mixture of long-span
denoising objectives, for planning and summarization. We empirically
demonstrate that our recipe improves the success on a real website by over 50%,
and that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9%
higher success rate than prior SoTA on the MiniWoB web navigation benchmark and
better accuracy on offline task planning evaluation.