WebVoyager : Construction d'un agent web de bout en bout avec des modèles multimodaux de grande taille

papers.abstract

L'avancée des grands modèles de langage (LLM) marque l'avènement d'une nouvelle ère caractérisée par le développement d'applications autonomes dans le monde réel, ce qui stimule l'innovation dans la création d'agents web avancés. Les agents web existants ne gèrent généralement qu'une seule modalité d'entrée et sont évalués uniquement dans des simulateurs web simplifiés ou des instantanés web statiques, limitant ainsi considérablement leur applicabilité dans des scénarios réels. Pour combler cette lacune, nous présentons WebVoyager, un agent web innovant basé sur un grand modèle multimodal (LMM) capable d'exécuter des instructions utilisateur de bout en bout en interagissant avec des sites web réels. De plus, nous proposons un nouveau protocole d'évaluation pour les agents web afin de relever les défis de l'évaluation automatique des tâches ouvertes des agents web, en exploitant les robustes capacités de compréhension multimodale de GPT-4V. Nous créons un nouveau benchmark en collectant des tâches réelles provenant de 15 sites web largement utilisés pour évaluer nos agents. Nous démontrons que WebVoyager atteint un taux de réussite de 55,7 %, surpassant significativement les performances de GPT-4 (tous outils) et de la configuration WebVoyager (texte uniquement), mettant en évidence les capacités exceptionnelles de WebVoyager dans des applications pratiques. Nous constatons que notre évaluation automatique proposée atteint un accord de 85,3 % avec le jugement humain, ouvrant la voie à un développement accru des agents web dans un contexte réel.

English

The advancement of large language models (LLMs) leads to a new era marked by the development of autonomous applications in the real world, which drives innovation in the creation of advanced web-based agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we propose a new evaluation protocol for web agents to address the challenges of automatic evaluation of open-ended web agent tasks, leveraging the robust multimodal comprehension capabilities of GPT-4V. We create a new benchmark by gathering real-world tasks from 15 widely used websites to evaluate our agents. We show that WebVoyager achieves a 55.7% task success rate, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager in practical applications. We found that our proposed automatic evaluation achieves 85.3% agreement with human judgment, paving the way for further development of web agents in a real-world setting.

WebVoyager : Construction d'un agent web de bout en bout avec des modèles multimodaux de grande taille

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

papers.abstract

Support