WebRL：通過自我演進的線上課程訓練LLM Web代理人強化學習

摘要

大型語言模型（LLMs）展現出在網絡任務中作為自主代理的顯著潛力。然而，現有的LLM網絡代理在很大程度上依賴昂貴的專有LLM API，而開放的LLMs則缺乏必要的決策能力。本文介紹了WebRL，這是一個自我演化的在線課程強化學習框架，旨在使用開放的LLMs訓練高性能網絡代理。WebRL解決了構建LLM網絡代理時的三個關鍵挑戰，包括訓練任務的稀缺性、反饋信號的稀疏性以及在線學習中的策略分佈漂移。具體來說，WebRL包括1）一個自我演化課程，從失敗的嘗試中生成新任務，2）一個強大的結果監督獎勵模型（ORM），以及3）適應性強化學習策略，以確保持續改進。我們應用WebRL將開放的Llama-3.1和GLM-4模型轉換為熟練的網絡代理。在WebArena-Lite上，WebRL將Llama-3.1-8B的成功率從4.8%提高到42.4%，將GLM-4-9B的成功率從6.1%提高到43%。這些開放模型顯著超越了GPT-4-Turbo（17.6%）和GPT-4o（13.9%）的性能，並且勝過先前在開放的LLMs上訓練的最先進網絡代理（AutoWebGLM，18.2%）。我們的研究結果表明WebRL在構建開放和專有LLM為基礎的網絡代理之間的差距上的有效性，為更具可訪問性和強大的自主網絡互動系統鋪平了道路。

English

Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.

WebRL：通過自我演進的線上課程訓練LLM Web代理人強化學習

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

摘要

Support