現実世界のWebAgent：計画、長文脈理解、プログラム合成を備えたシステム

要旨

事前学習済み大規模言語モデル（LLM）は、最近、自律的なウェブナビゲーションにおいてより優れた汎化性能とサンプル効率を達成しています。しかし、実世界のウェブサイトでの性能は依然として、(1) オープンドメイン性、(2) 限られたコンテキスト長、(3) HTMLに対する帰納的バイアスの欠如といった課題に直面しています。本論文では、自然言語の指示に従って実際のウェブサイト上でタスクを完了できるLLM駆動エージェント「WebAgent」を紹介します。WebAgentは、指示を標準的なサブ指示に分解して事前に計画を立て、長いHTMLドキュメントをタスクに関連するスニペットに要約し、それらから生成されたPythonプログラムを通じてウェブサイト上で行動します。WebAgentは、Flan-U-PaLMを基盤としたコード生成と、ローカルおよびグローバルな注意機構と長範囲のノイズ除去目標を組み合わせた新しい事前学習LLMであるHTML-T5を計画と要約のために設計しています。実証実験により、我々のアプローチが実ウェブサイトでの成功率を50%以上向上させ、HTML-T5がHTMLベースのタスクを解決する最適なモデルであることを示しました。MiniWoBウェブナビゲーションベンチマークでは従来のSoTAを14.9%上回る成功率を達成し、オフラインのタスク計画評価でもより高い精度を実現しました。

English

Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web navigation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that can complete the tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via generated Python programs from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our recipe improves the success on a real website by over 50%, and that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9% higher success rate than prior SoTA on the MiniWoB web navigation benchmark and better accuracy on offline task planning evaluation.

現実世界のWebAgent：計画、長文脈理解、プログラム合成を備えたシステム

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

要旨

Support