命令ファインチューニングされた基盤モデルを用いたマルチモーダルWebナビゲーション

要旨

自律的なウェブナビゲーションの進展は、オンライン強化学習を通じた数十億もの探索的相互作用への依存、および豊富なドメイン外データからの一般化を活用することを困難にするドメイン固有のモデル設計によって妨げられてきた。本研究では、視覚言語基盤モデルを用いたウェブエージェントのデータ駆動型オフライントレーニングを検討する。ウェブページのスクリーンショットとHTMLページの両方を観察し、クリックやタイプなどのウェブナビゲーションアクションを出力する命令追従型マルチモーダルエージェント、WebGUMを提案する。WebGUMは、命令ファインチューニングされた言語モデルとビジョントランスフォーマーを大規模なデモンストレーションコーパスで共同でファインチューニングすることによってトレーニングされる。この手法が、エージェントのグラウンディングされた視覚知覚、HTML理解、および多段階推論の能力を向上させ、従来の研究を大幅に上回ることを実証的に示す。MiniWoBベンチマークでは、従来の最良のオフライン手法を31.9%以上改善し、オンラインファインチューニングされたSoTAに近づいている。WebShopベンチマークでは、30億パラメータのモデルが既存のSoTAであるPaLM-540Bを上回る性能を達成する。また、トレーニング済みモデルを使用して347Kの高品質なデモンストレーションを収集し、これは従来の研究の38倍の規模であり、今後の研究を促進するために公開する。

English

The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision transformer on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded visual perception, HTML comprehension and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB benchmark, we improve over the previous best offline methods by more than 31.9%, being close to reaching online-finetuned SoTA. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.

命令ファインチューニングされた基盤モデルを用いたマルチモーダルWebナビゲーション

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

要旨

Support