哪裡出錯了？基於語意狀態追蹤的網路代理流程層級評估

摘要

網頁代理程式透過長互動序列運作，然而現有基準僅評估最終成功與否，忽略了所有過程資訊，也幾乎無法提供改進方向。在本研究中，我們對網頁代理程式進行了過程層級的分析。我們提出了 WebStep，一個包含 1,800 個任務實例的基準測試，具備可控的難度與自動語義狀態追蹤。每個網站在圖形使用者介面（GUI）之外，還暴露了一個確定性的語義馬可夫決策過程（MDP）：代理程式在介面上操作，而環境在背景中記錄高層次狀態與轉換，從而無需人工標註即可進行細粒度分析。基於語義軌跡，我們首先證明過程指標能揭示結果評估無法察覺的差異：三個成功率集中在 31% 至 33% 的代理程式，在探索範圍與執行準確度上表現各異。接著，按技能進行分解，說明了這些差異的本質，揭露了在同一網站內部隱藏的、技能層級相反的排名：例如，在 Housing 網站上，OpenAI CUA 在提交操作上優於 Qwen3.5 23.7%，但在篩選操作上卻落後 15.6%，這精確指出了即使在同一個領域中也存在具體可改進的技能。分岔分析進一步定位了導致任務失敗的關鍵錯誤，並顯示此錯誤是代理程式特有的，而非共通的。最後，這些差異隨著任務難度增加而擴大：在簡單任務中成功率相似，但當探索需求增加時，成功率則明顯分化。我們的過程層級分析為網頁代理程式的評估開闢了新途徑，提供了細粒度且可操作的洞察，指出每個代理程式該在何處及如何加以改進。

English

Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.