WebGen-Agent:透過多層次反饋與步驟級強化學習提升互動式網站生成效能
WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning
September 26, 2025
作者: Zimu Lu, Houxing Ren, Yunqiao Yang, Ke Wang, Zhuofan Zong, Junting Pan, Mingjie Zhan, Hongsheng Li
cs.AI
摘要
基於大型語言模型(LLMs)的代理系統在倉庫級代碼生成任務中展現了卓越的性能。然而,對於依賴視覺效果和用戶互動反饋的網站代碼庫生成任務,當前的代碼代理僅依賴於簡單的代碼執行來獲取反饋和驗證。這種方法無法捕捉生成代碼的實際質量。本文提出了一種新型網站生成代理——WebGen-Agent,該代理利用全面且多層次的視覺反饋,迭代生成並精煉網站代碼庫。通過視覺語言模型(VLM),生成關於網站截圖和GUI代理測試的詳細且具表達性的文本描述與建議,並量化其質量評分。截圖和GUI代理評分進一步與回溯和擇優機制相結合,提升了代理的性能。利用WebGen-Agent工作流程中固有的精確視覺評分,我們進一步引入了帶有截圖和GUI代理反饋的Step-GRPO,以增強LLMs作為WebGen-Agent推理引擎的能力。通過將每一步的截圖和GUI代理評分作為Step-GRPO中的獎勵,我們提供了一個密集且可靠的過程監督信號,有效提升了模型的網站生成能力。在WebGen-Bench數據集上,WebGen-Agent將Claude-3.5-Sonnet的準確率從26.4%提升至51.9%,外觀評分從3.0提升至3.9,超越了先前最先進的代理系統。此外,我們的Step-GRPO訓練方法將Qwen2.5-Coder-7B-Instruct的準確率從38.9%提升至45.4%,外觀評分從3.4提升至3.7。
English
Agent systems powered by large language models (LLMs) have demonstrated
impressive performance on repository-level code-generation tasks. However, for
tasks such as website codebase generation, which depend heavily on visual
effects and user-interaction feedback, current code agents rely only on simple
code execution for feedback and verification. This approach fails to capture
the actual quality of the generated code. In this paper, we propose
WebGen-Agent, a novel website-generation agent that leverages comprehensive and
multi-level visual feedback to iteratively generate and refine the website
codebase. Detailed and expressive text descriptions and suggestions regarding
the screenshots and GUI-agent testing of the websites are generated by a visual
language model (VLM), together with scores that quantify their quality. The
screenshot and GUI-agent scores are further integrated with a backtracking and
select-best mechanism, enhancing the performance of the agent. Utilizing the
accurate visual scores inherent in the WebGen-Agent workflow, we further
introduce Step-GRPO with Screenshot and GUI-agent Feedback to improve
the ability of LLMs to act as the reasoning engine of WebGen-Agent. By using
the screenshot and GUI-agent scores at each step as the reward in Step-GRPO, we
provide a dense and reliable process supervision signal, which effectively
improves the model's website-generation ability. On the WebGen-Bench dataset,
WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9%
and its appearance score from 3.0 to 3.9, outperforming the previous
state-of-the-art agent system. Additionally, our Step-GRPO training approach
increases the accuracy of Qwen2.5-Coder-7B-Instruct from 38.9% to 45.4% and
raises the appearance score from 3.4 to 3.7.