具有規劃、長期上下文理解和程式合成功能的真實世界 Web 代理程式

摘要

最近，預訓練的大型語言模型（LLMs）在自主網頁導航中取得了更好的泛化能力和樣本效率。然而，在真實世界網站上的表現仍然受到三個問題的困擾：（1）開放域性，（2）有限的上下文長度，以及（3）對HTML缺乏歸納偏差。我們介紹了WebAgent，一種由LLM驅動的代理，可以根據自然語言指令在真實網站上完成任務。WebAgent通過將指令分解為規範子指令來提前規劃，將長HTML文檔總結為與任務相關的片段，並通過生成的Python程序在網站上執行這些操作。我們設計了擁有Flan-U-PaLM的WebAgent，用於基於代碼生成，以及HTML-T5，用於長HTML文檔的新預訓練LLMs，使用局部和全局注意機制以及混合長跨度去噪目標，用於規劃和總結。我們在實驗中證明，我們的方法使在真實網站上的成功率提高了50％以上，而HTML-T5是解決基於HTML任務的最佳模型；在MiniWoB網頁導航基準測試中比之前的最先進技術高出14.9％的成功率，並在離線任務規劃評估中具有更高的準確性。

English

Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web navigation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that can complete the tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via generated Python programs from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our recipe improves the success on a real website by over 50%, and that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9% higher success rate than prior SoTA on the MiniWoB web navigation benchmark and better accuracy on offline task planning evaluation.

具有規劃、長期上下文理解和程式合成功能的真實世界 Web 代理程式

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

摘要

Support