AsyncWebRL: Efficiënte Multi-Step RL voor Visuele Webagenten

Samenvatting

Het trainen van visie-taal webagenten met meerstaps RL is rekenintensief, met twee dominante vormen van inefficiëntie: inactieve GPU's in synchrone RL, en trajecten die meer stappen en tokens gebruiken dan nodig. We presenteren AsyncWebRL, dat beide aanpakt. Aan de systeemkant overlapt een asynchroon ontwerp rollout, gradientupdate en policy refresh over iteraties, samen met twee webagent-specifieke aanpassingen, namelijk een eeuwigdurende rollout pool en lichtgewicht screenshotverwerking, die samen een tot 2,9 keer versnelling van de end-to-end trainingsdoorvoer opleveren ten opzichte van de eerder snelste open synchrone pipeline (WebGym). Aan de algoritmische kant identificeren we de per-trajectory normalizer 1/|τ_i| in meerstaps GRPO als de hoofdoorzaak van inefficiëntie op traject- en tokenniveau: omdat mislukkingen systematisch langer zijn dan successen, vermindert het het gewicht van de negatieve gradient op mislukte tokens, waardoor het beleid blijft zorgen voor breedsprakige geheugenschema's. Het vervangen van 1/|τ_i| door een constante 1/k verbreekt deze koppeling, verkort trajecten terwijl het totale succes behouden blijft. Samen zetten deze bijdragen een nieuwe open-source state of the art op de WebGym out-of-distribution testsplit (+5,8% relatief ten opzichte van het eerdere beste resultaat van 42,9%), met de grootste winst op de moeilijkere slices (+42% relatief op Medium, +48% relatief op Hard).

English

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a 2.9times end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer 1/|τ_i| in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing 1/|τ_i| with a constant 1/k breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).