WebChoreArena: 현실적인 지루한 웹 작업에서 웹 브라우징 에이전트 평가

초록

대규모 언어 모델(LLM)을 기반으로 작동하는 웹 브라우징 에이전트는 인간과 유사한 방식으로 웹 브라우저를 조작하며, 다양한 일상 작업을 자동화하는 데 있어 높은 투명성을 제공한다. 웹 에이전트가 일반적인 브라우징 작업에서 점점 더 능숙해지고 있음에 따라, 중요한 질문이 제기된다: 이들이 일반적인 브라우징을 넘어 지루하고 복잡한 작업, 혹은 인간이 스스로 하기 꺼려하는 일들을 견고하게 처리할 수 있을까? 본 논문에서는 WebArena의 범위를 일반적인 브라우징에서 더 많은 노동이 요구되고 지루한 작업으로 확장하기 위해 설계된 532개의 신중하게 선별된 작업으로 구성된 새로운 완전 재현 가능한 벤치마크인 WebChoreArena를 소개한다. WebChoreArena는 세 가지 주요 도전 과제를 체계적으로 통합한다: (i) 관찰에서 대량의 정보를 정확하게 검색해야 하는 대용량 메모리 작업, (ii) 정확한 수학적 추론이 요구되는 계산 작업, (iii) 여러 웹페이지에 걸친 장기 기억이 필요한 장기 메모리 작업. 완전 재현 가능하고 널리 채택된 네 가지 WebArena 시뮬레이션 환경 위에 구축된 WebChoreArena는 엄격한 재현 가능성을 보장하며, 기존 WebArena 벤치마크와의 공정하고 직접적인 비교를 가능하게 하여 에이전트의 진전에 대한 핵심 통찰을 제공한다. 우리의 실험 결과는 GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Pro로 대표되는 LLM의 진화에 따라 WebChoreArena에서 성능의 상당한 개선이 관찰됨을 보여준다. 이러한 결과는 WebChoreArena가 최첨단 LLM의 진전을 더 명확하게 측정하는 데 적합함을 시사한다. 그러나 결과는 Gemini 2.5 Pro를 사용하더라도 WebArena와 비교했을 때 여전히 개선의 여지가 크며, 이는 WebChoreArena가 제기하는 증가된 도전 과제를 강조한다.

English

Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.

WebChoreArena: 현실적인 지루한 웹 작업에서 웹 브라우징 에이전트 평가

WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

초록

Support