El Ecosistema BrowserGym para la Investigación de Agentes Web

Resumen

El ecosistema BrowserGym aborda la creciente necesidad de evaluación y comparación eficientes de agentes web, especialmente aquellos que aprovechan la automatización y los Modelos de Lenguaje Grande (LLMs) para tareas de interacción web. Muchos benchmarks existentes sufren de fragmentación y metodologías de evaluación inconsistentes, lo que dificulta lograr comparaciones confiables y resultados reproducibles. BrowserGym tiene como objetivo resolver esto al proporcionar un entorno unificado, similar a un gimnasio, con espacios de observación y acción bien definidos, facilitando la evaluación estandarizada en diversos benchmarks. Combinado con AgentLab, un marco complementario que ayuda en la creación, prueba y análisis de agentes, BrowserGym ofrece flexibilidad para integrar nuevos benchmarks mientras garantiza una evaluación consistente y una gestión de experimentos integral. Este enfoque estandarizado busca reducir el tiempo y la complejidad en el desarrollo de agentes web, respaldando comparaciones más confiables y facilitando el análisis profundo de los comportamientos de los agentes, lo que podría resultar en agentes más adaptables y capaces, acelerando en última instancia la innovación en la automatización impulsada por LLM. Como evidencia de apoyo, realizamos el primer experimento de agentes web a gran escala y multi-benchmark, comparando el rendimiento de 6 LLMs de última generación en todos los benchmarks actualmente disponibles en BrowserGym. Entre otros hallazgos, nuestros resultados resaltan una gran discrepancia entre los modelos más recientes de OpenAI y Anthropic, siendo Claude-3.5-Sonnet el líder en casi todos los benchmarks, excepto en tareas relacionadas con la visión donde GPT-4o es superior. A pesar de estos avances, nuestros resultados enfatizan que construir agentes web robustos y eficientes sigue siendo un desafío significativo, debido a la complejidad inherente de los entornos web del mundo real y las limitaciones de los modelos actuales.

English

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

El Ecosistema BrowserGym para la Investigación de Agentes Web

The BrowserGym Ecosystem for Web Agent Research

Resumen

Support