AsyncWebRL：面向視覺網頁代理的高效多步強化學習

摘要

訓練視覺語言網頁代理的多步強化學習運算密集，主要存在兩種效率低下的形式：同步強化學習中GPU閒置，以及軌跡使用過多步驟和標記。我們提出AsyncWebRL，同時解決這兩個問題。在系統層面，一種異步設計將迭代間的生成、梯度更新和策略更新重疊，並搭配兩種網頁代理專屬的適配機制，即永不枯竭的生成池和輕量級截圖處理，相較於先前最快的開放同步管線（WebGym），可實現高達2.9倍的端到端訓練吞吐量加速。在演算法層面，我們發現多步GRPO中的每個軌跡歸一化器1/|τ_i|是造成軌跡層級和標記層級低效的根本原因：由於失敗軌跡系統性地比成功軌跡更長，此歸一化器會降低失敗標記負梯度的權重，導致策略持續產生冗長的記憶模式。將1/|τ_i|替換為常數1/k可打破此耦合，在維持整體成功率的同時縮短軌跡。綜合這些貢獻，我們在WebGym分佈外測試集上創下新的開源技術水準（相較於先前最佳42.9%相對提升5.8%），在更困難的子集上獲得最大增益（Medium相對提升42%，Hard相對提升48%）。

English

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a 2.9times end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer 1/|τ_i| in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing 1/|τ_i| with a constant 1/k breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).