WebChoreArena: 現実的な煩雑なウェブタスクにおけるウェブブラウジングエージェントの評価

要旨

大規模言語モデル（LLM）を動力源とするウェブブラウジングエージェントは、人間のようにウェブブラウザを操作し、日常的なタスクの自動化に向けた高度に透明な道筋を提供する。ウェブエージェントが一般ブラウジングタスクにおいてますます能力を発揮し、熟練を示すにつれて、重要な疑問が浮かび上がる：彼らは一般ブラウジングを超えて、退屈で複雑なタスク、あるいは人間がしばしば避けるような雑務を堅実に処理できるのか？本論文では、WebArenaの範囲を一般ブラウジングからより労力を要し退屈なタスクへと拡張するために設計された、532の慎重に選ばれたタスクからなる新たな完全再現可能なベンチマーク、WebChoreArenaを紹介する。WebChoreArenaは、以下の3つの主要な課題を体系的に統合している：(i) 観察において大量の情報を正確に検索することを要求する大規模メモリタスク、(ii) 正確な数学的推論を要求する計算タスク、(iii) 複数のウェブページにわたる長期的な記憶を必要とする長期記憶タスク。完全再現可能で広く採用されている4つのWebArenaシミュレーション環境を基盤として構築されたWebChoreArenaは、厳密な再現性を確保し、確立されたWebArenaベンチマークとの公平で直接的な比較を可能にし、エージェントの進歩に関する重要な洞察を提供する。我々の実験結果は、GPT-4o、Claude 3.7 Sonnet、Gemini 2.5 Proに代表されるLLMの進化に伴い、WebChoreArenaにおける性能の大幅な向上が観察されることを示している。これらの知見は、WebChoreArenaが最先端のLLMの進歩をより明確に測定するのに適していることを示唆している。しかしながら、結果はまた、Gemini 2.5 Proを用いても、WebArenaと比較して改善の余地が依然として大きいことを示しており、WebChoreArenaがもたらす増大した課題を浮き彫りにしている。

English

Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.

WebChoreArena: 現実的な煩雑なウェブタスクにおけるウェブブラウジングエージェントの評価

WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

要旨

Support