AsyncWebRL：面向视觉网页代理的高效多步强化学习

摘要

训练视觉语言网络代理的多步强化学习计算量巨大，主要存在两种效率瓶颈：同步强化学习中的GPU空闲问题，以及轨迹消耗过多步骤和token的问题。我们提出AsyncWebRL来同时应对这两个挑战。在系统层面，异步设计使轨迹生成、梯度更新和策略刷新在迭代间重叠，配合两项针对网络代理的改进——即永续轨迹池和轻量化截图处理——相比此前最快的开源同步流水线（WebGym），端到端训练吞吐量最高提升2.9倍。在算法层面，我们发现多步GRPO中的每轨迹归一化因子1/|τ_i|是导致轨迹级和token级低效的根本原因：由于失败轨迹系统性地长于成功轨迹，该因子弱化了失败token的负梯度，导致策略持续生成冗余的记忆模式。将1/|τ_i|替换为常数1/k可打破这种耦合，在保持聚合成功率的同时缩短轨迹长度。这些贡献在WebGym的分布外测试集上创下了新的开源最优水平（较之前42.9%的最佳结果相对提升5.8%），且在较难子集上提升最大（中等难度相对提升42%，高难度相对提升48%）。

English

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a 2.9times end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer 1/|τ_i| in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing 1/|τ_i| with a constant 1/k breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).