從對網頁應用程式代碼進行基準測試的前沿語言模型中獲得的見解

摘要

本文介紹了對16個前沿大型語言模型（LLMs）在WebApp1K基準測試中的評估結果，這是一個旨在評估LLMs生成Web應用程式代碼能力的測試套件。結果顯示，儘管所有模型具有相似的基礎知識，但它們的表現卻因其犯錯的頻率而有所不同。通過分析代碼行數（LOC）和錯誤分佈，我們發現編寫正確代碼比生成不正確代碼更為複雜。此外，提示工程在減少錯誤方面的效果有限，除了特定情況外。這些發現表明，進一步改進編碼LLM應該強調模型的可靠性和減少錯誤。

English

This paper presents insights from evaluating 16 frontier large language models (LLMs) on the WebApp1K benchmark, a test suite designed to assess the ability of LLMs to generate web application code. The results reveal that while all models possess similar underlying knowledge, their performance is differentiated by the frequency of mistakes they make. By analyzing lines of code (LOC) and failure distributions, we find that writing correct code is more complex than generating incorrect code. Furthermore, prompt engineering shows limited efficacy in reducing errors beyond specific cases. These findings suggest that further advancements in coding LLM should emphasize on model reliability and mistake minimization.

從對網頁應用程式代碼進行基準測試的前沿語言模型中獲得的見解

Insights from Benchmarking Frontier Language Models on Web App Code Generation

摘要

Support