测试即提示：面向大语言模型代码生成的测试驱动开发基准

摘要

我们推出了WebApp1K，这是一个用于评估大型语言模型（LLMs）在测试驱动开发（TDD）任务中表现的新颖基准，其中测试用例既作为提示又作为代码生成的验证手段。与依赖自然语言提示的传统方法不同，我们的基准强调LLMs直接从测试用例中解读并实现功能的能力，这反映了现实世界中的软件开发实践。该基准包含20个应用领域的1000个多样化挑战，评估LLMs在上下文长度限制和多功能复杂性约束下生成简洁、功能性代码的能力。我们的研究结果表明，指令遵循和上下文内学习是TDD成功的关键能力，其重要性超过了通用编码熟练度或预训练知识。通过对19个前沿模型的全面评估，我们揭示了性能瓶颈，如长提示中的指令丢失，并提供了涵盖多种根本原因的详细错误分析。这项工作强调了TDD特定基准的实际价值，并为在严格、应用驱动的编码场景中提升LLM能力奠定了基础。

English

We introduce WebApp1K, a novel benchmark for evaluating large language models (LLMs) in test-driven development (TDD) tasks, where test cases serve as both prompt and verification for code generation. Unlike traditional approaches relying on natural language prompts, our benchmark emphasizes the ability of LLMs to interpret and implement functionality directly from test cases, reflecting real-world software development practices. Comprising 1000 diverse challenges across 20 application domains, the benchmark evaluates LLMs on their ability to generate compact, functional code under the constraints of context length and multi-feature complexity. Our findings highlight instruction following and in-context learning as critical capabilities for TDD success, surpassing the importance of general coding proficiency or pretraining knowledge. Through comprehensive evaluation of 19 frontier models, we reveal performance bottlenecks, such as instruction loss in long prompts, and provide a detailed error analysis spanning multiple root causes. This work underscores the practical value of TDD-specific benchmarks and lays the foundation for advancing LLM capabilities in rigorous, application-driven coding scenarios.

测试即提示：面向大语言模型代码生成的测试驱动开发基准

Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation

摘要

Support