測試即提示：面向LLM代碼生成的測試驅動開發基準

摘要

我們推出了WebApp1K，這是一個新穎的基準測試，用於評估大型語言模型（LLMs）在測試驅動開發（TDD）任務中的表現，其中測試案例既作為提示也作為代碼生成的驗證。與依賴自然語言提示的傳統方法不同，我們的基準測試強調LLMs直接從測試案例中解釋和實現功能的能力，這反映了現實世界中的軟件開發實踐。該基準測試包含20個應用領域中的1000個多樣化挑戰，評估LLMs在上下文長度和多功能複雜性約束下生成簡潔、功能性代碼的能力。我們的研究結果表明，指令遵循和上下文學習是TDD成功的關鍵能力，超越了通用編碼熟練度或預訓練知識的重要性。通過對19個前沿模型的全面評估，我們揭示了性能瓶頸，例如長提示中的指令丟失，並提供了涵蓋多個根本原因的詳細錯誤分析。這項工作強調了TDD特定基準測試的實際價值，並為在嚴格的、應用驅動的編碼場景中提升LLM能力奠定了基礎。

English

We introduce WebApp1K, a novel benchmark for evaluating large language models (LLMs) in test-driven development (TDD) tasks, where test cases serve as both prompt and verification for code generation. Unlike traditional approaches relying on natural language prompts, our benchmark emphasizes the ability of LLMs to interpret and implement functionality directly from test cases, reflecting real-world software development practices. Comprising 1000 diverse challenges across 20 application domains, the benchmark evaluates LLMs on their ability to generate compact, functional code under the constraints of context length and multi-feature complexity. Our findings highlight instruction following and in-context learning as critical capabilities for TDD success, surpassing the importance of general coding proficiency or pretraining knowledge. Through comprehensive evaluation of 19 frontier models, we reveal performance bottlenecks, such as instruction loss in long prompts, and provide a detailed error analysis spanning multiple root causes. This work underscores the practical value of TDD-specific benchmarks and lays the foundation for advancing LLM capabilities in rigorous, application-driven coding scenarios.

測試即提示：面向LLM代碼生成的測試驅動開發基準

Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation

摘要

Support