WebGym：基于真实任务的视觉网络智能体可扩展训练环境构建

摘要

我们推出WebGym——迄今为止规模最大的开源视觉网页智能体训练环境。真实网站具有非平稳性和多样性特点，使得人工或小规模任务集难以支撑稳健的策略学习。WebGym包含近30万个任务，基于量规评估体系覆盖多样化真实网站及不同难度层级。我们采用简易强化学习方案训练智能体：通过智能体自身交互轨迹进行训练，并以任务奖励作为学习反馈。为实现强化学习的规模化扩展，我们专门为网页智能体开发了高吞吐量异步轨迹采样系统，使WebGym的轨迹采样速度较原始实现提升4-5倍。其次，通过扩展任务集的广度、深度和规模，我们实现了持续的性能提升。基于Qwen-3-VL-8B-Instruct强大多模态基础模型在WebGym上进行微调后，其在分布外测试集上的成功率从26.2%提升至42.9%，显著优于基于GPT-4o（27.1%）和GPT-5-Thinking（29.8%）等专有模型的智能体。这一提升意义重大，因为我们的测试集仅包含训练阶段从未接触的网站任务，这与多数现有视觉网页智能体训练研究形成鲜明对比。

English

We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5x rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.