あなたのLLMは秘密裏にインターネットのワールドモデルですか？Webエージェントのためのモデルベースプランニング

要旨

言語エージェントは、ウェブベースのタスクを自動化する能力を有望に示していますが、現在の反応型アプローチは、人間と比較して大幅に性能が劣っています。特に木探索法などの高度な計画アルゴリズムを組み込むことで、これらのエージェントの性能を向上させることができますが、ライブウェブサイト上で直接木探索を実装することは、購入確認などの不可逆的なアクションによる重大な安全リスクや実用上の制約があります。本論文では、言語エージェントをモデルベースの計画で補強する革新的なパラダイムを紹介します。このパラダイムは、大規模言語モデル（LLMs）を複雑なウェブ環境におけるワールドモデルとして革新的に活用します。具体的には、WebDreamerという手法は、LLMsを使用して候補アクションごとにアウトカムをシミュレートし（例：「このボタンをクリックした場合、何が起こるか？」）、これらの想定される結果を評価して各ステップで最適なアクションを決定します。オンラインインタラクションを伴う2つの代表的なウェブエージェントベンチマーク、VisualWebArenaとMind2Web-liveにおける実証結果は、WebDreamerが反応型ベースラインに比べて実質的な改善を達成していることを示しています。LLMsをウェブ環境におけるワールドモデルとしての実用性を確立することで、この研究は自動化されたウェブインタラクションのパラダイムシフトの基盤を築いています。さらに、本研究の成果は、将来の研究において、1）複雑で動的な環境におけるワールドモデリングのためにLLMsを特に最適化すること、および2）言語エージェントのためのモデルベースの仮説的計画に関する新たな研究分野を開く興奮すべき新しい可能性を提示しています。

English

Language agents have demonstrated promising capabilities in automating web-based tasks, though their current reactive approaches still underperform largely compared to humans. While incorporating advanced planning algorithms, particularly tree search methods, could enhance these agents' performance, implementing tree search directly on live websites poses significant safety risks and practical constraints due to irreversible actions such as confirming a purchase. In this paper, we introduce a novel paradigm that augments language agents with model-based planning, pioneering the innovative use of large language models (LLMs) as world models in complex web environments. Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities. Specifically, WebDreamer uses LLMs to simulate outcomes for each candidate action (e.g., "what would happen if I click this button?") using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step. Empirical results on two representative web agent benchmarks with online interaction -- VisualWebArena and Mind2Web-live -- demonstrate that WebDreamer achieves substantial improvements over reactive baselines. By establishing the viability of LLMs as world models in web environments, this work lays the groundwork for a paradigm shift in automated web interaction. More broadly, our findings open exciting new avenues for future research into 1) optimizing LLMs specifically for world modeling in complex, dynamic environments, and 2) model-based speculative planning for language agents.

あなたのLLMは秘密裏にインターネットのワールドモデルですか？Webエージェントのためのモデルベースプランニング

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

要旨

Support