L'écosystème BrowserGym pour la recherche sur les agents web

papers.abstract

L'écosystème BrowserGym répond au besoin croissant d'évaluation et de comparaison efficaces des agents web, en particulier ceux exploitant l'automatisation et les Grands Modèles de Langage (GML) pour les tâches d'interaction web. De nombreux benchmarks existants souffrent de fragmentation et de méthodologies d'évaluation incohérentes, rendant difficile la réalisation de comparaisons fiables et de résultats reproductibles. BrowserGym vise à résoudre ce problème en fournissant un environnement unifié de type salle de sport avec des espaces d'observation et d'action bien définis, facilitant l'évaluation standardisée à travers divers benchmarks. Associé à AgentLab, un cadre complémentaire qui aide à la création, au test et à l'analyse des agents, BrowserGym offre une flexibilité pour l'intégration de nouveaux benchmarks tout en garantissant une évaluation cohérente et une gestion complète des expériences. Cette approche standardisée vise à réduire le temps et la complexité du développement d'agents web, soutenant des comparaisons plus fiables et facilitant l'analyse approfondie des comportements des agents, et pourrait aboutir à des agents plus adaptables et performants, accélérant ainsi l'innovation dans l'automatisation basée sur les GML. En tant que preuve à l'appui, nous menons la première expérience à grande échelle avec plusieurs benchmarks d'agents web et comparons les performances de 6 GML de pointe sur tous les benchmarks actuellement disponibles dans BrowserGym. Entre autres résultats, nos résultats mettent en évidence une grande disparité entre les derniers modèles d'OpenAI et d'Anthropic, Claude-3.5-Sonnet se démarquant sur presque tous les benchmarks, sauf sur les tâches liées à la vision où GPT-4o est supérieur. Malgré ces avancées, nos résultats soulignent que la construction d'agents web robustes et efficaces reste un défi majeur, en raison de la complexité inhérente des environnements web réels et des limites des modèles actuels.

English

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

L'écosystème BrowserGym pour la recherche sur les agents web

The BrowserGym Ecosystem for Web Agent Research

papers.abstract

Support