具有規劃、長期上下文理解和程式合成功能的真實世界 Web 代理程式
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
July 24, 2023
作者: Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust
cs.AI
摘要
最近,預訓練的大型語言模型(LLMs)在自主網頁導航中取得了更好的泛化能力和樣本效率。然而,在真實世界網站上的表現仍然受到三個問題的困擾:(1)開放域性,(2)有限的上下文長度,以及(3)對HTML缺乏歸納偏差。我們介紹了WebAgent,一種由LLM驅動的代理,可以根據自然語言指令在真實網站上完成任務。WebAgent通過將指令分解為規範子指令來提前規劃,將長HTML文檔總結為與任務相關的片段,並通過生成的Python程序在網站上執行這些操作。我們設計了擁有Flan-U-PaLM的WebAgent,用於基於代碼生成,以及HTML-T5,用於長HTML文檔的新預訓練LLMs,使用局部和全局注意機制以及混合長跨度去噪目標,用於規劃和總結。我們在實驗中證明,我們的方法使在真實網站上的成功率提高了50%以上,而HTML-T5是解決基於HTML任務的最佳模型;在MiniWoB網頁導航基準測試中比之前的最先進技術高出14.9%的成功率,並在離線任務規劃評估中具有更高的準確性。
English
Pre-trained large language models (LLMs) have recently achieved better
generalization and sample efficiency in autonomous web navigation. However, the
performance on real-world websites has still suffered from (1) open domainness,
(2) limited context length, and (3) lack of inductive bias on HTML. We
introduce WebAgent, an LLM-driven agent that can complete the tasks on real
websites following natural language instructions. WebAgent plans ahead by
decomposing instructions into canonical sub-instructions, summarizes long HTML
documents into task-relevant snippets, and acts on websites via generated
Python programs from those. We design WebAgent with Flan-U-PaLM, for grounded
code generation, and HTML-T5, new pre-trained LLMs for long HTML documents
using local and global attention mechanisms and a mixture of long-span
denoising objectives, for planning and summarization. We empirically
demonstrate that our recipe improves the success on a real website by over 50%,
and that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9%
higher success rate than prior SoTA on the MiniWoB web navigation benchmark and
better accuracy on offline task planning evaluation.