AsyncWebRL: ビジュアルWebエージェントのための効率的なマルチステップ強化学習

要旨

マルチステップ強化学習による視覚言語ウェブエージェントの訓練は計算集約的であり、効率性を損なう二つの主要な要因がある。すなわち、同期強化学習におけるGPUのアイドル状態と、必要以上のステップ数やトークン数を消費する軌跡である。本稿では、これらの問題に対処するAsyncWebRLを提案する。システム面では、非同期設計によりロールアウト、勾配更新、ポリシー更新をイテレーション間で重ね合わせ、さらにウェブエージェント特有の適応として永続ロールアウトプールと軽量スクリーンショット処理を組み合わせることで、従来最速のオープンソース同期パイプライン（WebGym）と比較して、エンドツーエンドの訓練スループットを最大2.9倍高速化する。アルゴリズム面では、マルチステップGRPOにおける軌跡ごとの正規化係数1/|τ_i|が、軌跡レベルおよびトークンレベルの非効率性の根本原因であることを特定する。失敗軌跡は成功軌跡よりも体系的に長いため、この係数が失敗トークンに対する負の勾配を過小評価し、その結果、方策は冗長なメモリスキーマを生成し続ける。1/|τ_i|を定数1/kに置き換えることでこの結合を断ち切り、全体の成功率を維持しながら軌跡を短縮する。これらの貢献により、WebGymの分布外テスト分割において、新たなオープンソースの最高水準を達成した（従来最高の42.9%から相対5.8%向上）。特に困難なサブセットでは大きな改善が見られ（Mediumで相対42%向上、Hardで相対48%向上）。

English

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a 2.9times end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer 1/|τ_i| in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing 1/|τ_i| with a constant 1/k breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).