どこで間違えたのか？ — 意味状態追跡によるWebエージェントのプロセスレベル評価

要旨

Webエージェントは長いインタラクション系列を通じて動作するが、既存のベンチマークは最終的な成功のみを評価し、プロセス情報をすべて破棄して改善への指針をほとんど提供しない。本研究では、Webエージェントのプロセスレベルの分析を行う。我々は、難易度が制御され、自動的な意味状態追跡を備えた1,800タスクインスタンスのベンチマークであるWebStepを導入する。各ウェブサイトは、GUIと並行して決定論的な意味的MDPを公開する。エージェントはインターフェース上で動作し、環境はバックグラウンドで高レベルの状態と遷移を記録し、手動によるアノテーションなしで詳細な分析を可能にする。意味的軌跡に基づき、まずプロセス指標が結果評価では見えない差異を明らかにすることを示す。成功率が31～33％の範囲に集中する3つのエージェントが、探索範囲と実行精度において乖離する。次に、スキルごとに分解することでこれらの差異の性質を特徴づけ、同一ウェブサイト内に隠された逆転したスキル別ランキングを明らかにする。例えば、Housingにおいて、OpenAI CUAはコミットアクションでQwen3.5を23.7%上回る一方、フィルタリングでは15.6%下回り、ドメイン内でも改善すべき具体的なスキルを特定する。分岐分析により、タスクを失敗に導く決定的なエラーをさらに特定し、このエラーがエージェント固有であり、共通ではないことを示す。最後に、これらの差異はタスクが難しくなるにつれて拡大する。簡単なタスクでは成功率は類似しているが、探索がより要求されるにつれて急激に乖離する。我々のプロセスレベルの分析は、Webエージェント評価に新たな道を開き、各エージェントをどこでどのように改善すべきかについて、詳細かつ実用的な洞察を提供する。

English

Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.