AsyncWebRL: 시각적 웹 에이전트를 위한 효율적인 다단계 강화학습

초록

다단계 강화학습으로 비전-언어 웹 에이전트를 훈련하는 것은 계산 집약적이며, 두 가지 주요 비효율성 요인이 존재한다: 동기식 강화학습에서의 유휴 GPU, 그리고 필요 이상의 단계와 토큰을 사용하는 궤적이다. 본 논문에서는 이 두 문제를 모두 해결하는 AsyncWebRL을 제시한다. 시스템 측면에서는 비동기 설계가 반복 간 롤아웃, 기울기 업데이트, 정책 갱신을 중첩시키며, 웹 에이전트에 특화된 두 가지 적응 기법, 즉 영구 롤아웃 풀과 경량 스크린샷 처리를 결합하여 기존 가장 빠른 오픈 동기식 파이프라인(WebGym) 대비 종단간 훈련 처리량을 최대 2.9배 향상시킨다. 알고리즘 측면에서는 다단계 GRPO의 궤적별 정규화기 1/|τ_i|가 궤적 수준 및 토큰 수준 비효율성의 근본 원인임을 규명한다: 실패 궤적이 성공 궤적보다 체계적으로 길기 때문에, 이 정규화기는 실패 토큰에 대한 음의 기울기를 낮춰 정책이 계속 장황한 메모리 스키마를 생성하게 만든다. 1/|τ_i|를 상수 1/k로 대체하면 이러한 결합이 끊어져 궤적이 단축되면서도 전체 성공률은 유지된다. 이러한 기여를 통해 WebGym의 분포 외 테스트 분할에서 새로운 오픈소스 최고 성능을 달성하였으며(기존 최고 42.9% 대비 상대적 +5.8%), 특히 더 어려운 부분에서 더 큰 향상을 보였다(Medium: 상대적 +42%, Hard: 상대적 +48%).

English

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a 2.9times end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer 1/|τ_i| in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing 1/|τ_i| with a constant 1/k breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).