OpenWebVoyager: イテレーション型の実世界探索、フィードバック、最適化を通じて、マルチモーダルWebエージェントを構築する

要旨

大規模言語およびマルチモーダルモデルの急速な発展は、GPT-4oなどの独自のモデルを使用して、Webナビゲーションなどの実世界シナリオを処理できる自律エージェントを開発することに大きな関心を引き起こしました。最近のオープンソースの取り組みでは、エージェントに環境の探索能力と時間と共に向上する能力を備えさせようと試みられていますが、報酬信号が明確に定義された合成環境でのテキスト専用エージェントを構築しています。このようなエージェントは、マルチモーダルな知覚能力が必要であり、グラウンドトゥルース信号が欠如する現実的な設定に一般化するのに苦労しています。本論文では、自律的に実世界の探索を行い、自己改善することができるマルチモーダルWebエージェントの開発を容易にするオープンソースフレームワークを紹介します。まず、基本的な能力を獲得するために模倣学習でベースモデルをトレーニングします。その後、エージェントにオープンWebを探索させ、軌跡に関するフィードバックを収集します。その後、別の汎用モデルによって判断された性能の良い軌跡から学習することで、ポリシーをさらに改善します。この探索-フィードバック-最適化サイクルは、複数の反復で続けることができます。実験結果は、当社のWebエージェントが各反復後に自己改善し、複数のテストセット全体で強力なパフォーマンスを示すことを示しています。

English

The rapid development of large language and multimodal models has sparked significant interest in using proprietary models, such as GPT-4o, to develop autonomous agents capable of handling real-world scenarios like web navigation. Although recent open-source efforts have tried to equip agents with the ability to explore environments and continuously improve over time, they are building text-only agents in synthetic environments where the reward signals are clearly defined. Such agents struggle to generalize to realistic settings that require multimodal perception abilities and lack ground-truth signals. In this paper, we introduce an open-source framework designed to facilitate the development of multimodal web agent that can autonomously conduct real-world exploration and improve itself. We first train the base model with imitation learning to gain the basic abilities. We then let the agent explore the open web and collect feedback on its trajectories. After that, it further improves its policy by learning from well-performing trajectories judged by another general-purpose model. This exploration-feedback-optimization cycle can continue for several iterations. Experimental results show that our web agent successfully improves itself after each iteration, demonstrating strong performance across multiple test sets.

OpenWebVoyager: イテレーション型の実世界探索、フィードバック、最適化を通じて、マルチモーダルWebエージェントを構築する

OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization

要旨

Support