WebChallenger: 신뢰할 수 있고 효율적인 범용 웹 에이전트

초록

자율적 웹 탐색은 LLM 에이전트에게 여전히 어려운 과제로 남아 있으며, 가장 강력한 범용 시스템은 독점적 추론 모델에 의존하는데, 이러한 에이전트가 가장 유용하게 사용될 반복적인 작업에는 추론 비용이 prohibitive(부담스러운 수준)이다. 본 논문은 이러한 격차가 충분하지 않은 모델 능력이 아닌, 인간의 세 가지 인지적 이점, 즉 관련 페이지 영역에 대한 선택적 주의, 웹사이트 구조의 지속적 기억, 일반적인 상호작용 패턴에 대한 절차적 유창성을 재현하지 못하는 에이전트 아키텍처에서 비롯된다고 주장한다. 우리는 아키텍처 설계를 통해 각 격차를 해소하는 웹 에이전트 프레임워크인 WebChallenger를 소개한다. 이 프레임워크는 PageMem을 중심으로 구축되었으며, PageMem은 DOM으로부터 결정론적으로 구성된 구조화된 페이지 표현으로, 각 페이지를 짧은 요약과 함께 의미적 섹션의 계층 구조로 노출한다. 이 공유 기반 위에서 우리는 세 가지 인지적 이점을 반영한 세 가지 메커니즘을 구축한다: 에이전트가 섹션 요약을 훑어보고 작업 관련 영역에서만 세부 정보를 추출할 수 있게 하는 분할 정복 관찰 파이프라인; 각 웹사이트를 한 번 탐색하여 페이지와 요소 동작의 재사용 가능한 지도를 구축하는 경량 탐색 및 기억 시스템; 그리고 일반적인 다단계 상호작용을 단일 에이전트 행동으로 축소하고 부분 상태 변화를 자동으로 처리하는 복합 행동 워크플로우이다. 세 메커니즘 모두 PageMem 위에서 작동하기 때문에 프레임워크는 사이트별 어댑터 없이도 웹사이트 간 일반화된다. 미세 조정 없이 기성 오픈웨이트 모델을 사용하여, 우리 시스템은 WebArena에서 56.3%, VisualWebArena에서 48.7%, Online-Mind2Web에서 51.0%, WorkArena에서 70.9%의 성능을 달성하며, 극히 일부의 비용으로 최첨단 독점 시스템에 근접한다. 코드는 https://github.com/jayoohwang1/webchallenger 에 공개되어 있다.

English

Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We argue this gap stems not from insufficient model capability but from agent architectures that fail to replicate three human cognitive advantages: selective attention to relevant page regions, persistent memory of website structure, and procedural fluency with common interaction patterns. We introduce WebChallenger, a web agent framework that addresses each gap through architecture design rather than model scale, built around PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries. On this shared substrate we build three mechanisms that mirror the three cognitive advantages: a divide-and-conquer observation pipeline that lets the agent skim section summaries and extract details only from task-relevant regions; a lightweight exploration and memory system that traverses each website once to build a reusable map of pages and element behaviors; and compound action workflows that collapse common multi-step interactions into single agent actions, handling partial state changes automatically. Because all three operate over PageMem, the framework generalizes across websites without site-specific adapters. Using off-the-shelf open-weight models without fine-tuning, our system achieves 56.3% on WebArena, 48.7% on VisualWebArena, 51.0% on Online-Mind2Web, and 70.9% on WorkArena, approaching frontier proprietary systems at a fraction of the cost. Our code is released at https://github.com/jayoohwang1/webchallenger