WebArbiter：面向网络智能体的原则引导型推理过程奖励模型

摘要

网络智能体在自动化复杂计算机任务方面潜力巨大，但其交互过程涉及具有不可逆操作的长期序贯决策。在此类场景中，基于结果的监督信号既稀疏又延迟，往往会对错误轨迹给予奖励，且无法支持推理阶段的扩展。这促使研究者将过程奖励模型（WebPRMs）应用于网络导航任务，但现有方法仍存在局限：标量型WebPRMs将进展压缩为粗糙且弱关联的信号，而清单式WebPRMs依赖脆弱的模板匹配机制，在界面布局或语义变化时容易失效，常将表面正确的操作误判为成功，缺乏洞察力与可解释性。为应对这些挑战，我们提出WebArbiter——一种推理优先、原则导向的WebPRM框架，将奖励建模转化为文本生成任务，通过生成结构化论证来得出结论性判断，并识别当前情境下最有利于任务完成的操作。训练采用两阶段流程：推理蒸馏阶段赋予模型连贯的原则指导推理能力，强化学习阶段则通过直接对齐判断结果与正确性来修正教师模型偏差，从而实现更强泛化能力。为支持系统化评估，我们发布WebPRMBench基准测试集，涵盖四个多样化网络环境，包含丰富任务场景及高质量偏好标注。在WebPRMBench上，WebArbiter-7B以9.1分优势超越最强基线GPT-5；在WebArena-Lite的奖励引导轨迹搜索中，其表现较现有最佳WebPRM提升达7.2分，彰显了其在真实复杂网络任务中的鲁棒性与实用价值。

English

Web agents hold great potential for automating complex computer tasks, yet their interactions involve long-horizon, sequential decision-making with irreversible actions. In such settings, outcome-based supervision is sparse and delayed, often rewarding incorrect trajectories and failing to support inference-time scaling. This motivates the use of Process Reward Models (WebPRMs) for web navigation, but existing approaches remain limited: scalar WebPRMs collapse progress into coarse, weakly grounded signals, while checklist-based WebPRMs rely on brittle template matching that fails under layout or semantic changes and often mislabels superficially correct actions as successful, providing little insight or interpretability. To address these challenges, we introduce WebArbiter, a reasoning-first, principle-inducing WebPRM that formulates reward modeling as text generation, producing structured justifications that conclude with a preference verdict and identify the action most conducive to task completion under the current context. Training follows a two-stage pipeline: reasoning distillation equips the model with coherent principle-guided reasoning, and reinforcement learning corrects teacher biases by directly aligning verdicts with correctness, enabling stronger generalization. To support systematic evaluation, we release WebPRMBench, a comprehensive benchmark spanning four diverse web environments with rich tasks and high-quality preference annotations. On WebPRMBench, WebArbiter-7B outperforms the strongest baseline, GPT-5, by 9.1 points. In reward-guided trajectory search on WebArena-Lite, it surpasses the best prior WebPRM by up to 7.2 points, underscoring its robustness and practical value in real-world complex web tasks.

WebArbiter：面向网络智能体的原则引导型推理过程奖励模型

WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents

摘要

Support