WebGen-R1：基于强化学习的大型语言模型功能性与美学网站生成激励框架

摘要

尽管大语言模型在函数级代码生成方面表现出色，但项目级任务（如生成功能完善且视觉美观的多页网站）仍极具挑战。现有研究多局限于单页静态网站，而智能体框架通常依赖专有模型进行多轮执行，导致高昂的令牌成本、高延迟及脆弱的集成性。虽然通过强化学习端到端训练小型大语言模型是颇具前景的替代方案，但其在网站生成任务中面临关键瓶颈：如何设计可靠且计算可行的奖励机制。与可通过单元测试验证的单文件编程任务不同，网站生成需评估具有内在主观性的美学效果、跨页面交互及功能正确性。为此，我们提出WebGen-R1——专为项目级网站生成设计的端到端强化学习框架。我们首先引入支架驱动的结构化生成范式，通过约束开放式动作空间来保持架构完整性；继而设计新型级联多模态奖励机制，将结构化保障与基于执行的功能反馈、视觉美学监督无缝耦合。大量实验表明，WebGen-R1能将7B基础模型从生成几乎不可用的网站转变为可部署且符合美学标准的多页网站。值得注意的是，该框架不仅持续超越大规模开源模型（最高达72B），在功能成功率上比肩最先进的DeepSeek-R1（671B），更在有效渲染与美学对齐方面显著优于后者。这些成果表明WebGen-R1为小型开源模型从函数级代码生成扩展到项目级Web应用生成提供了可行路径。

English

While Large Language Models (LLMs) excel at function-level code generation, project-level tasks such as generating functional and visually aesthetic multi-page websites remain highly challenging. Existing works are often limited to single-page static websites, while agentic frameworks typically rely on multi-turn execution with proprietary models, leading to substantial token costs, high latency, and brittle integration. Training a small LLM end-to-end with reinforcement learning (RL) is a promising alternative, yet it faces a critical bottleneck in designing reliable and computationally feasible rewards for website generation. Unlike single-file coding tasks that can be verified by unit tests, website generation requires evaluating inherently subjective aesthetics, cross-page interactions, and functional correctness. To this end, we propose WebGen-R1, an end-to-end RL framework tailored for project-level website generation. We first introduce a scaffold-driven structured generation paradigm that constrains the large open-ended action space and preserves architectural integrity. We then design a novel cascaded multimodal reward that seamlessly couples structural guarantees with execution-grounded functional feedback and vision-based aesthetic supervision. Extensive experiments demonstrate that our WebGen-R1 substantially transforms a 7B base model from generating nearly nonfunctional websites into producing deployable, aesthetically aligned multi-page websites. Remarkably, our WebGen-R1 not only consistently outperforms heavily scaled open-source models (up to 72B), but also rivals the state-of-the-art DeepSeek-R1 (671B) in functional success, while substantially exceeding it in valid rendering and aesthetic alignment. These results position WebGen-R1 as a viable path for scaling small open models from function-level code generation to project-level web application generation.