WebGen-Bench：評估大型語言模型從零生成互動且功能性網站的能力

摘要

基於大型語言模型（LLM）的代理在生成和管理複雜程式碼庫方面展現了巨大潛力。本文介紹了WebGen-Bench，這是一個新穎的基準測試，旨在衡量LLM代理從零開始創建多檔案網站程式碼庫的能力。該基準包含了多樣化的網站生成指令，這些指令由人類註解員和GPT-4o共同創建，涵蓋了三大類別和十三個子類別，幾乎包含了所有重要的網路應用類型。為了評估生成網站的質量，我們使用GPT-4o生成針對每個指令功能的測試案例，並手動篩選、調整和組織這些案例以確保準確性，最終得到647個測試案例。每個測試案例都指定了在網站上執行的操作以及操作後的預期結果。為了自動化測試並提高可重現性，我們採用了一個強大的網路導航代理來在生成的網站上執行測試，並確定觀察到的回應是否符合預期結果。我們評估了三個高性能的程式碼代理框架——Bolt.diy、OpenHands和Aider，並使用多個專有和開源的LLM作為引擎。表現最佳的組合是由DeepSeek-R1驅動的Bolt.diy，在測試案例上僅達到了27.8%的準確率，這凸顯了我們基準測試的挑戰性。此外，我們構建了WebGen-Instruct，這是一個由6,667個網站生成指令組成的訓練集。在Bolt.diy軌跡上訓練Qwen2.5-Coder-32B-Instruct，這些軌跡來自該訓練集的一個子集，達到了38.2%的準確率，超越了最佳專有模型的表現。

English

LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases. Each test case specifies an operation to be performed on the website and the expected result after the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute tests on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks, Bolt.diy, OpenHands, and Aider, using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8\% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of this training set achieves an accuracy of 38.2\%, surpassing the performance of the best proprietary model.

WebGen-Bench：評估大型語言模型從零生成互動且功能性網站的能力

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

摘要

Support