ChatPaper.aiChatPaper

WebChoreArena:评估网页浏览代理在现实繁琐网页任务中的表现

WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

June 2, 2025
作者: Atsuyuki Miyai, Zaiying Zhao, Kazuki Egashira, Atsuki Sato, Tatsumi Sunada, Shota Onohara, Hiromasa Yamanishi, Mashiro Toyooka, Kunato Nishina, Ryoma Maeda, Kiyoharu Aizawa, Toshihiko Yamasaki
cs.AI

摘要

依托大型语言模型(LLM)驱动的网页浏览代理,能够以类人的方式操作浏览器,为自动化日常任务提供了一条高度透明的路径。随着网页代理能力的不断提升,并在通用浏览任务中展现出熟练度,一个关键问题浮现:它们能否超越通用浏览,稳健地处理那些繁琐复杂的任务,或是人类常避之不及的杂务?本文中,我们引入了WebChoreArena,一个全新的、完全可复现的基准测试,包含532项精心设计的任务,旨在将WebArena的范畴从通用浏览扩展至更为劳动密集且乏味的任务领域。WebChoreArena系统性地整合了三大关键挑战:(一)海量记忆任务,要求准确检索观察中的大量信息;(二)计算任务,需精确的数学推理能力;(三)长期记忆任务,跨越多个网页,考验长期记忆能力。基于完全可复现且广泛采用的WebArena四大模拟环境构建,WebChoreArena确保了严格的复现性,并支持与已确立的WebArena基准进行公平、直接的比较,为代理进展提供了关键洞见。我们的实验结果显示,随着LLM的演进,以GPT-4o、Claude 3.7 Sonnet及Gemini 2.5 Pro为代表,在WebChoreArena上的性能均有显著提升。这些发现表明,WebChoreArena非常适合以更高的清晰度衡量顶尖LLM的进步。然而,结果也指出,即便是Gemini 2.5 Pro,相较于WebArena仍有较大提升空间,凸显了WebChoreArena带来的更高挑战。
English
Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.
PDF103June 3, 2025