WebChallenger: 信頼性が高く効率的な汎用Webエージェント

要旨

自律的なWebナビゲーションは依然としてLLMエージェントにとって困難な課題であり、最も強力な汎用システムはプロプライエタリな推論モデルに依存しているため、そうしたエージェントが最も有用となる反復的なタスクでは推論コストが法外なものとなる。我々は、このギャップはモデルの能力不足ではなく、人間の3つの認知的利点、すなわち関連するページ領域への選択的注意、ウェブサイト構造の持続的記憶、および一般的な操作パターンへの手続き的習熟を再現できないエージェントアーキテクチャに起因すると主張する。本論文では、各ギャップをモデル規模ではなくアーキテクチャ設計によって解決するWebエージェントフレームワークであるWebChallengerを紹介する。その中核として、DOMから決定論的に構築される構造化ページ表現であるPageMemを提案する。これは各ページを短い要約を持つセマンティックセクションの階層として公開する。この共有基盤の上に、上記3つの認知的利点を反映した3つのメカニズムを構築する。すなわち、エージェントがセクション要約をスキミングし、タスクに関連する領域からのみ詳細を抽出できる分割統治観測パイプライン、各ウェブサイトを一度だけ巡回してページと要素動作の再利用可能なマップを構築する軽量な探索・記憶システム、および一般的な複数ステップの操作を単一のエージェントアクションに集約し、部分的な状態変化を自動的に処理する複合アクションワークフローである。これら3つはすべてPageMem上で動作するため、本フレームワークはサイト固有のアダプターなしに様々なウェブサイトに一般化できる。微調整なしの既製のオープンウェイトモデルを用いて、WebChallengerはWebArenaで56.3%、VisualWebArenaで48.7%、Online-Mind2Webで51.0%、WorkArenaで70.9%の精度を達成し、ごく一部のコストで最先端のプロプライエタリシステムに迫る性能を示す。コードはhttps://github.com/jayoohwang1/webchallengerで公開されている。

English

Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We argue this gap stems not from insufficient model capability but from agent architectures that fail to replicate three human cognitive advantages: selective attention to relevant page regions, persistent memory of website structure, and procedural fluency with common interaction patterns. We introduce WebChallenger, a web agent framework that addresses each gap through architecture design rather than model scale, built around PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries. On this shared substrate we build three mechanisms that mirror the three cognitive advantages: a divide-and-conquer observation pipeline that lets the agent skim section summaries and extract details only from task-relevant regions; a lightweight exploration and memory system that traverses each website once to build a reusable map of pages and element behaviors; and compound action workflows that collapse common multi-step interactions into single agent actions, handling partial state changes automatically. Because all three operate over PageMem, the framework generalizes across websites without site-specific adapters. Using off-the-shelf open-weight models without fine-tuning, our system achieves 56.3% on WebArena, 48.7% on VisualWebArena, 51.0% on Online-Mind2Web, and 70.9% on WorkArena, approaching frontier proprietary systems at a fraction of the cost. Our code is released at https://github.com/jayoohwang1/webchallenger