WebGen-Agent：透過多層次反饋與步驟級強化學習提升互動式網站生成效能

摘要

基於大型語言模型（LLMs）的代理系統在倉庫級代碼生成任務中展現了卓越的性能。然而，對於依賴視覺效果和用戶互動反饋的網站代碼庫生成任務，當前的代碼代理僅依賴於簡單的代碼執行來獲取反饋和驗證。這種方法無法捕捉生成代碼的實際質量。本文提出了一種新型網站生成代理——WebGen-Agent，該代理利用全面且多層次的視覺反饋，迭代生成並精煉網站代碼庫。通過視覺語言模型（VLM），生成關於網站截圖和GUI代理測試的詳細且具表達性的文本描述與建議，並量化其質量評分。截圖和GUI代理評分進一步與回溯和擇優機制相結合，提升了代理的性能。利用WebGen-Agent工作流程中固有的精確視覺評分，我們進一步引入了帶有截圖和GUI代理反饋的Step-GRPO，以增強LLMs作為WebGen-Agent推理引擎的能力。通過將每一步的截圖和GUI代理評分作為Step-GRPO中的獎勵，我們提供了一個密集且可靠的過程監督信號，有效提升了模型的網站生成能力。在WebGen-Bench數據集上，WebGen-Agent將Claude-3.5-Sonnet的準確率從26.4%提升至51.9%，外觀評分從3.0提升至3.9，超越了先前最先進的代理系統。此外，我們的Step-GRPO訓練方法將Qwen2.5-Coder-7B-Instruct的準確率從38.9%提升至45.4%，外觀評分從3.4提升至3.7。

English

Agent systems powered by large language models (LLMs) have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and user-interaction feedback, current code agents rely only on simple code execution for feedback and verification. This approach fails to capture the actual quality of the generated code. In this paper, we propose WebGen-Agent, a novel website-generation agent that leverages comprehensive and multi-level visual feedback to iteratively generate and refine the website codebase. Detailed and expressive text descriptions and suggestions regarding the screenshots and GUI-agent testing of the websites are generated by a visual language model (VLM), together with scores that quantify their quality. The screenshot and GUI-agent scores are further integrated with a backtracking and select-best mechanism, enhancing the performance of the agent. Utilizing the accurate visual scores inherent in the WebGen-Agent workflow, we further introduce Step-GRPO with Screenshot and GUI-agent Feedback to improve the ability of LLMs to act as the reasoning engine of WebGen-Agent. By using the screenshot and GUI-agent scores at each step as the reward in Step-GRPO, we provide a dense and reliable process supervision signal, which effectively improves the model's website-generation ability. On the WebGen-Bench dataset, WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system. Additionally, our Step-GRPO training approach increases the accuracy of Qwen2.5-Coder-7B-Instruct from 38.9% to 45.4% and raises the appearance score from 3.4 to 3.7.

WebGen-Agent：透過多層次反饋與步驟級強化學習提升互動式網站生成效能

WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning

摘要

Support