ChatPaper.aiChatPaper

WebChallenger:一个可靠高效的通用网页代理

WebChallenger: A Reliable and Efficient Generalist Web Agent

June 9, 2026
作者: Jayoo Hwang, Xiaowen Zhang, Vedant Padwal
cs.AI

摘要

自主网页导航对LLM智能体而言仍然充满挑战,最强的通用系统依赖于专有推理模型,其推理成本对于这类智能体最适用的重复性任务而言过高。我们认为这一差距并非源于模型能力不足,而是智能体架构未能复现人类的三项认知优势:对相关页面区域的选择性注意、对网站结构的持久记忆,以及对常见交互模式的操作流畅性。我们提出WebChallenger——一个通过架构设计而非模型规模来应对上述短板的网页智能体框架,其核心是PageMem:一种从DOM确定性构建的结构化页面表示,将每个页面呈现为带简短摘要的语义分节层级结构。在此共享基础上,我们构建了三种机制以镜像上述三项认知优势:一种分治观察流水线,让智能体快速浏览分节摘要,仅从任务相关区域提取细节;一套轻量级探索与记忆系统,对每个网站遍历一次以构建可复用的页面与元素行为地图;以及复合操作工作流,将常见的多步交互压缩为单一智能体动作,自动处理部分状态变化。由于三者均基于PageMem运行,该框架无需针对特定网站的适配器即可跨网站泛化。使用未经微调的现成开源权重模型,我们的系统在WebArena上达到56.3%,在VisualWebArena上达到48.7%,在Online-Mind2Web上达到51.0%,在WorkArena上达到70.9%,以极低的成本接近前沿专有系统性能。我们的代码已发布于https://github.com/jayoohwang1/webchallenger。
English
Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We argue this gap stems not from insufficient model capability but from agent architectures that fail to replicate three human cognitive advantages: selective attention to relevant page regions, persistent memory of website structure, and procedural fluency with common interaction patterns. We introduce WebChallenger, a web agent framework that addresses each gap through architecture design rather than model scale, built around PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries. On this shared substrate we build three mechanisms that mirror the three cognitive advantages: a divide-and-conquer observation pipeline that lets the agent skim section summaries and extract details only from task-relevant regions; a lightweight exploration and memory system that traverses each website once to build a reusable map of pages and element behaviors; and compound action workflows that collapse common multi-step interactions into single agent actions, handling partial state changes automatically. Because all three operate over PageMem, the framework generalizes across websites without site-specific adapters. Using off-the-shelf open-weight models without fine-tuning, our system achieves 56.3% on WebArena, 48.7% on VisualWebArena, 51.0% on Online-Mind2Web, and 70.9% on WorkArena, approaching frontier proprietary systems at a fraction of the cost. Our code is released at https://github.com/jayoohwang1/webchallenger