WebGen-Bench: ゼロからのインタラクティブで機能的なウェブサイト生成におけるLLMの評価

要旨

LLMベースのエージェントは、複雑なコードベース内でのコード生成と管理において大きな可能性を示しています。本論文では、LLMベースのエージェントがゼロからマルチファイルのウェブサイトコードベースを作成する能力を測定するための新しいベンチマーク、WebGen-Benchを紹介します。このベンチマークは、人間のアノテーターとGPT-4oの共同作業によって作成された、ウェブサイト生成のための多様な指示を含んでいます。これらの指示は、3つの主要カテゴリと13のサブカテゴリにまたがり、ほぼすべての重要なタイプのウェブアプリケーションを網羅しています。生成されたウェブサイトの品質を評価するために、GPT-4oを使用して指示に記載された各機能を対象としたテストケースを生成し、その後、正確性を確保するために手動でフィルタリング、調整、整理を行い、647のテストケースを作成しました。各テストケースは、ウェブサイト上で実行される操作と、その操作後の期待される結果を指定しています。テストの自動化と再現性の向上のために、強力なウェブナビゲーションエージェントを使用して生成されたウェブサイト上でテストを実行し、観察された応答が期待される結果と一致するかどうかを判断します。私たちは、Bolt.diy、OpenHands、Aiderという3つの高性能コードエージェントフレームワークを、複数のプロプライエタリおよびオープンソースのLLMをエンジンとして使用して評価しました。最高のパフォーマンスを示した組み合わせである、DeepSeek-R1を搭載したBolt.diyは、テストケースにおいてわずか27.8%の精度しか達成できず、私たちのベンチマークの難易度の高さを浮き彫りにしました。さらに、6,667のウェブサイト生成指示からなるトレーニングセット、WebGen-Instructを構築しました。このトレーニングセットの一部から生成されたBolt.diyの軌跡を使用してQwen2.5-Coder-32B-Instructをトレーニングした結果、38.2%の精度を達成し、最高のプロプライエタリモデルの性能を上回りました。

English

LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases. Each test case specifies an operation to be performed on the website and the expected result after the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute tests on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks, Bolt.diy, OpenHands, and Aider, using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8\% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of this training set achieves an accuracy of 38.2\%, surpassing the performance of the best proprietary model.

WebGen-Bench: ゼロからのインタラクティブで機能的なウェブサイト生成におけるLLMの評価

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

要旨

Support