WebChoreArena：在真实繁琐网页任务上评估网页浏览智能体

摘要

依托于大型语言模型（LLM）驱动的网页浏览代理，能够以类人的方式操作浏览器，为自动化日常任务开辟了一条高度透明的路径。随着网页代理能力的不断提升，在通用浏览任务中展现出熟练度，一个关键问题随之浮现：它们能否超越通用浏览，稳健地处理那些繁琐复杂、人类常避之不及的任务？本文中，我们推出了WebChoreArena，一个全新的、完全可复现的基准测试，包含532项精心设计的任务，旨在将WebArena的测试范围从通用浏览扩展至更为费时费力的任务。WebChoreArena系统性地整合了三大挑战：(i) 海量记忆任务，要求从观察中准确检索大量信息；(ii) 计算任务，需进行精确的数学推理；(iii) 长期记忆任务，要求跨多个网页保持长期记忆。基于完全可复现且广泛采用的四个WebArena模拟环境构建，WebChoreArena确保了严格的复现性，并支持与现有WebArena基准的公平直接对比，为代理进展提供了关键洞见。我们的实验结果显示，随着LLM的演进，以GPT-4o、Claude 3.7 Sonnet和Gemini 2.5 Pro为代表，在WebChoreArena上的性能均有显著提升。这些发现表明，WebChoreArena非常适合以更高的清晰度衡量最先进LLM的进步。然而，结果也指出，即便是Gemini 2.5 Pro，与WebArena相比仍有较大提升空间，凸显了WebChoreArena带来的更大挑战。

English

Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.

WebChoreArena：在真实繁琐网页任务上评估网页浏览智能体

WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

摘要

Support