AsyncWebRL: RL Eficiente de Múltiplos Passos para Agentes Web Visuais

Resumo

Treinar agentes web visão-linguagem com RL multi-passos é intensivo em computação, com duas formas dominantes de ineficiência: GPUs ociosas em RL síncrono e trajetórias que utilizam mais passos e tokens do que o necessário. Apresentamos o AsyncWebRL, que aborda ambas. No lado do sistema, um design assíncrono sobrepõe rollout, atualização de gradiente e atualização da política entre iterações, combinado com duas adaptações específicas para agentes web — nomeadamente, um pool de rollout perpétuo e manuseio leve de capturas de tela — que juntos proporcionam uma aceleração de até 2,9 vezes no throughput de treinamento ponta a ponta em relação ao pipeline síncrono aberto mais rápido anterior (WebGym). No lado algorítmico, identificamos o normalizador por trajetória 1/|τ_i| no GRPO multi-passos como a causa raiz da ineficiência ao nível de trajetória e ao nível de token: como as falhas são sistematicamente mais longas que os sucessos, ele reduz o peso do gradiente negativo em tokens com falha, de modo que a política continua produzindo esquemas de memória verbosos. Substituir 1/|τ_i| por uma constante 1/k rompe esse acoplamento, contraindo as trajetórias enquanto preserva o sucesso agregado. Juntas, essas contribuições estabelecem um novo estado da arte em código aberto na divisão de teste fora da distribuição do WebGym (+5,8% relativo ao melhor anterior de 42,9%), com os maiores ganhos nas partes mais difíceis (+42% relativo no Médio, +48% relativo no Difícil).

English

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a 2.9times end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer 1/|τ_i| in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing 1/|τ_i| with a constant 1/k breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).