WebGen-Bench：评估大语言模型从零生成交互式功能性网站的能力

摘要

基于大语言模型（LLM）的代理在生成和管理复杂代码库方面展现出巨大潜力。本文介绍了一种新颖的基准测试——WebGen-Bench，旨在评估基于LLM的代理从零开始创建多文件网站代码库的能力。该基准包含多样化的网站生成指令，这些指令由人类标注员与GPT-4o共同协作创建，涵盖三大类别和十三个子类别，几乎囊括了所有重要的Web应用类型。为了评估生成网站的质量，我们利用GPT-4o为指令中描述的每个功能生成测试用例，随后手动筛选、调整和组织，确保其准确性，最终得到647个测试用例。每个测试用例都规定了在网站上执行的操作以及操作后的预期结果。为了实现测试自动化并提高可重复性，我们采用了一个强大的网页导航代理来执行测试，并判断观察到的响应是否与预期结果一致。我们评估了三种高性能代码代理框架——Bolt.diy、OpenHands和Aider，使用多种专有和开源LLM作为引擎。表现最佳的组合是由DeepSeek-R1驱动的Bolt.diy，在测试用例上的准确率仅为27.8%，凸显了我们基准的挑战性。此外，我们构建了WebGen-Instruct，这是一个包含6,667条网站生成指令的训练集。在Bolt.diy轨迹上训练Qwen2.5-Coder-32B-Instruct，使用该训练集的一个子集，达到了38.2%的准确率，超越了最佳专有模型的表现。

English

LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases. Each test case specifies an operation to be performed on the website and the expected result after the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute tests on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks, Bolt.diy, OpenHands, and Aider, using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8\% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of this training set achieves an accuracy of 38.2\%, surpassing the performance of the best proprietary model.

WebGen-Bench：评估大语言模型从零生成交互式功能性网站的能力

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

摘要

Support