WebGen-Bench: 대화형 및 기능적 웹사이트를 처음부터 생성하는 LLM 평가

초록

LLM 기반 에이전트는 복잡한 코드베이스 내에서 코드를 생성하고 관리하는 데 있어 큰 잠재력을 보여주고 있습니다. 본 논문에서는 LLM 기반 에이전트가 처음부터 다중 파일 웹사이트 코드베이스를 생성하는 능력을 측정하기 위해 설계된 새로운 벤치마크인 WebGen-Bench를 소개합니다. 이 벤치마크는 인간 주석자와 GPT-4o의 협력을 통해 생성된 다양한 웹사이트 생성 지침을 포함하고 있습니다. 이러한 지침은 세 가지 주요 범주와 열세 가지 하위 범주로 구성되어 있으며, 거의 모든 중요한 유형의 웹 애플리케이션을 포괄합니다. 생성된 웹사이트의 품질을 평가하기 위해, 우리는 GPT-4o를 사용하여 지침에 설명된 각 기능을 대상으로 테스트 케이스를 생성한 후, 이를 수동으로 필터링, 조정 및 정리하여 정확성을 보장하였으며, 결과적으로 647개의 테스트 케이스를 확보했습니다. 각 테스트 케이스는 웹사이트에서 수행할 작업과 작업 후 예상 결과를 명시합니다. 테스트를 자동화하고 재현성을 향상시키기 위해, 우리는 강력한 웹 탐색 에이전트를 사용하여 생성된 웹사이트에서 테스트를 실행하고 관찰된 응답이 예상 결과와 일치하는지 판단합니다. 우리는 Bolt.diy, OpenHands, Aider와 같은 세 가지 고성능 코드 에이전트 프레임워크를 여러 독점 및 오픈소스 LLM을 엔진으로 사용하여 평가했습니다. 가장 높은 성능을 보인 조합인 DeepSeek-R1을 기반으로 한 Bolt.diy는 테스트 케이스에서 27.8%의 정확도를 달성했으며, 이는 우리의 벤치마크가 얼마나 도전적인지를 보여줍니다. 또한, 우리는 6,667개의 웹사이트 생성 지침으로 구성된 훈련 세트인 WebGen-Instruct를 구축했습니다. 이 훈련 세트의 일부에서 생성된 Bolt.diy 궤적을 사용하여 Qwen2.5-Coder-32B-Instruct를 훈련한 결과, 38.2%의 정확도를 달성하여 최고의 독점 모델의 성능을 능가했습니다.

English

LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases. Each test case specifies an operation to be performed on the website and the expected result after the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute tests on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks, Bolt.diy, OpenHands, and Aider, using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8\% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of this training set achieves an accuracy of 38.2\%, surpassing the performance of the best proprietary model.

WebGen-Bench: 대화형 및 기능적 웹사이트를 처음부터 생성하는 LLM 평가

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

초록

Support