O Ecossistema BrowserGym para Pesquisa de Agentes Web

Resumo

O ecossistema BrowserGym aborda a crescente necessidade de avaliação eficiente e benchmarking de agentes web, especialmente aqueles que utilizam automação e Modelos de Linguagem Grandes (LLMs) para tarefas de interação web. Muitos benchmarks existentes sofrem de fragmentação e metodologias de avaliação inconsistentes, tornando desafiador obter comparações confiáveis e resultados reproduzíveis. O BrowserGym visa resolver isso fornecendo um ambiente unificado, semelhante a um ginásio, com espaços de observação e ação bem definidos, facilitando a avaliação padronizada em diversos benchmarks. Combinado com o AgentLab, um framework complementar que auxilia na criação, teste e análise de agentes, o BrowserGym oferece flexibilidade para integrar novos benchmarks enquanto garante avaliação consistente e gerenciamento abrangente de experimentos. Essa abordagem padronizada busca reduzir o tempo e a complexidade no desenvolvimento de agentes web, apoiando comparações mais confiáveis e facilitando análises aprofundadas dos comportamentos dos agentes, o que poderia resultar em agentes mais adaptáveis e capazes, acelerando, em última instância, a inovação na automação impulsionada por LLMs. Como evidência de apoio, realizamos o primeiro experimento em larga escala com múltiplos benchmarks de agentes web e comparamos o desempenho de 6 LLMs de ponta em todos os benchmarks atualmente disponíveis no BrowserGym. Entre outras descobertas, nossos resultados destacam uma grande discrepância entre os modelos mais recentes da OpenAI e da Anthropic, com o Claude-3.5-Sonnet liderando em quase todos os benchmarks, exceto em tarefas relacionadas à visão, onde o GPT-4o é superior. Apesar desses avanços, nossos resultados enfatizam que construir agentes web robustos e eficientes ainda é um desafio significativo, devido à complexidade inerente dos ambientes web do mundo real e às limitações dos modelos atuais.

English

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

O Ecossistema BrowserGym para Pesquisa de Agentes Web

The BrowserGym Ecosystem for Web Agent Research

Resumo

Support