어디서 잘못되었을까? 의미 상태 추적을 통한 웹 에이전트의 프로세스 수준 평가

초록

웹 에이전트는 긴 상호작용 시퀀스를 통해 작동하지만, 기존 벤치마크는 최종 성공만을 평가하여 모든 과정 정보를 버리고 개선에 대한 지침을 거의 제공하지 않는다. 본 연구에서는 웹 에이전트의 프로세스 수준 분석을 수행한다. 우리는 난이도가 통제되고 자동 의미론적 상태 추적이 가능한 1,800개의 작업 인스턴스로 구성된 벤치마크인 WebStep을 소개한다. 각 웹사이트는 GUI와 함께 결정론적 의미론적 MDP를 노출한다. 에이전트는 인터페이스에서 작동하고, 환경은 백그라운드에서 높은 수준의 상태와 전환을 기록하여 수동 주석 없이 세분화된 분석을 가능하게 한다. 의미론적 궤적을 기반으로, 먼저 프로세스 지표가 결과 평가로는 드러나지 않는 차이를 보여줌을 확인한다. 성공률이 31~33% 내에 군집하는 세 에이전트는 탐색 도달 범위와 실행 정확도에서 차이를 보인다. 다음으로, 기술별 분해는 이러한 차이의 성격을 특성화하며, 동일한 웹사이트 내에 숨겨진 상반된 기술별 순위를 드러낸다. 예를 들어, Housing에서 OpenAI CUA는 커밋 행동에서 Qwen3.5보다 23.7% 우수하지만 필터링에서는 15.6% 열등하여, 한 도메인 내에서도 개선해야 할 구체적인 기술을 정확히 지적한다. 분기 분석은 작업을 실패하게 하는 결정적 오류를 추가로 국소화하며, 이 오류가 공유된 것이 아니라 에이전트 특이적임을 보여준다. 마지막으로, 이러한 차이는 작업이 더 어려워짐에 따라 확대된다. 쉬운 작업에서는 성공률이 유사하지만, 탐색 요구가 증가함에 따라 급격히 분리된다. 우리의 프로세스 수준 분석은 웹 에이전트 평가에 새로운 경로를 열어, 각 에이전트가 어디서 어떻게 개선되어야 하는지에 대한 세분화되고 실행 가능한 통찰을 제공한다.

English

Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.