Webエージェントとワールドモデル：Webナビゲーションにおける環境ダイナミクスの学習と活用

要旨

大規模言語モデル（LLMs）は、最近、自律エージェントの構築において多くの注目を集めています。しかしながら、現在のLLMベースのWebエージェントの長期タスクにおけるパフォーマンスは最適とは言えず、しばしば払い戻しができない航空券を繰り返し購入するなどのエラーが発生しています。これに対して、人間はそのような不可逆的な間違いを避けることができます。なぜなら、私たちは行動の潜在的な結果（例：お金を失うなど）を認識しており、これを「世界モデル」とも呼んでいます。このことから着想を得て、私たちの研究はまず、現在のLLMs（例：GPT-4o、Claude-3.5-Sonnetなど）に世界モデルが存在しないことを確認する予備的な分析から始めます。そして、行動の結果をシミュレートして意思決定を改善するための世界モデル拡張型（WMA）Webエージェントを提案します。次に、次の観測を予測する世界モデルとしてLLMsを訓練する際の課題を克服するために、観測における繰り返し要素や長いHTML入力などを取り扱う遷移に焦点を当てた観測抽象化を提案します。ここでは、予測目標は時間ステップ間の重要な状態の違いを排他的に強調する自由形式の自然言語記述です。WebArenaとMind2Webでの実験結果は、私たちの世界モデルがエージェントのポリシー選択を訓練なしで改善し、最近の木探索ベースのエージェントと比較して、エージェントのコスト効率と時間効率を示しています。

English

Large language models (LLMs) have recently gained much attention in building autonomous agents. However, the performance of current LLM-based web agents in long-horizon tasks is far from optimal, often yielding errors such as repeatedly buying a non-refundable flight ticket. By contrast, humans can avoid such an irreversible mistake, as we have an awareness of the potential outcomes (e.g., losing money) of our actions, also known as the "world model". Motivated by this, our study first starts with preliminary analyses, confirming the absence of world models in current LLMs (e.g., GPT-4o, Claude-3.5-Sonnet, etc.). Then, we present a World-model-augmented (WMA) web agent, which simulates the outcomes of its actions for better decision-making. To overcome the challenges in training LLMs as world models predicting next observations, such as repeated elements across observations and long HTML inputs, we propose a transition-focused observation abstraction, where the prediction objectives are free-form natural language descriptions exclusively highlighting important state differences between time steps. Experiments on WebArena and Mind2Web show that our world models improve agents' policy selection without training and demonstrate our agents' cost- and time-efficiency compared to recent tree-search-based agents.

Webエージェントとワールドモデル：Webナビゲーションにおける環境ダイナミクスの学習と活用

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

要旨

Support