WebGen-Agent：通过多级反馈与步骤级强化学习提升交互式网站生成能力

摘要

基于大型语言模型（LLMs）的代理系统在仓库级代码生成任务中展现了卓越性能。然而，对于如网站代码库生成这类高度依赖视觉效果和用户交互反馈的任务，当前代码代理仅依赖简单的代码执行进行反馈与验证，这种方法无法准确捕捉生成代码的实际质量。本文提出WebGen-Agent，一种创新的网站生成代理，它利用全面且多层次的视觉反馈，迭代生成并优化网站代码库。通过视觉语言模型（VLM），我们生成了关于网站截图和GUI代理测试的详细且富有表现力的文本描述与建议，并辅以量化其质量的评分。截图与GUI代理评分进一步与回溯及择优机制相结合，提升了代理的性能。借助WebGen-Agent工作流程中固有的精确视觉评分，我们进一步引入了带有截图与GUI代理反馈的Step-GRPO，以增强LLMs作为WebGen-Agent推理引擎的能力。通过将每一步的截图与GUI代理评分作为Step-GRPO的奖励，我们提供了密集且可靠的过程监督信号，有效提升了模型的网站生成能力。在WebGen-Bench数据集上，WebGen-Agent将Claude-3.5-Sonnet的准确率从26.4%提升至51.9%，外观评分从3.0提升至3.9，超越了先前的最先进代理系统。此外，我们的Step-GRPO训练方法使Qwen2.5-Coder-7B-Instruct的准确率从38.9%提升至45.4%，外观评分从3.4提升至3.7。

English

Agent systems powered by large language models (LLMs) have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and user-interaction feedback, current code agents rely only on simple code execution for feedback and verification. This approach fails to capture the actual quality of the generated code. In this paper, we propose WebGen-Agent, a novel website-generation agent that leverages comprehensive and multi-level visual feedback to iteratively generate and refine the website codebase. Detailed and expressive text descriptions and suggestions regarding the screenshots and GUI-agent testing of the websites are generated by a visual language model (VLM), together with scores that quantify their quality. The screenshot and GUI-agent scores are further integrated with a backtracking and select-best mechanism, enhancing the performance of the agent. Utilizing the accurate visual scores inherent in the WebGen-Agent workflow, we further introduce Step-GRPO with Screenshot and GUI-agent Feedback to improve the ability of LLMs to act as the reasoning engine of WebGen-Agent. By using the screenshot and GUI-agent scores at each step as the reward in Step-GRPO, we provide a dense and reliable process supervision signal, which effectively improves the model's website-generation ability. On the WebGen-Bench dataset, WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system. Additionally, our Step-GRPO training approach increases the accuracy of Qwen2.5-Coder-7B-Instruct from 38.9% to 45.4% and raises the appearance score from 3.4 to 3.7.

WebGen-Agent：通过多级反馈与步骤级强化学习提升交互式网站生成能力

WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning

摘要

Support