テストをプロンプトとして：LLMコード生成のためのテスト駆動開発ベンチマーク

要旨

WebApp1Kを紹介します。これは、テスト駆動開発（TDD）タスクにおける大規模言語モデル（LLM）の評価のための新しいベンチマークであり、テストケースがコード生成のプロンプトと検証の両方として機能します。自然言語プロンプトに依存する従来のアプローチとは異なり、このベンチマークは、LLMがテストケースから直接機能を解釈し実装する能力を重視し、実世界のソフトウェア開発手法を反映しています。20のアプリケーションドメインにわたる1000の多様な課題で構成され、このベンチマークは、コンテキスト長と多機能の複雑さの制約下で、コンパクトで機能的なコードを生成するLLMの能力を評価します。私たちの調査結果は、TDDの成功において、一般的なコーディング能力や事前学習の知識を超えて、指示の遵守と文脈内学習が重要な能力であることを強調しています。19の最先端モデルの包括的な評価を通じて、長いプロンプトでの指示の喪失などのパフォーマンスのボトルネックを明らかにし、複数の根本原因にわたる詳細なエラー分析を提供します。この研究は、TDD固有のベンチマークの実用的な価値を強調し、厳密でアプリケーション駆動のコーディングシナリオにおけるLLMの能力を進歩させるための基盤を築きます。

English

We introduce WebApp1K, a novel benchmark for evaluating large language models (LLMs) in test-driven development (TDD) tasks, where test cases serve as both prompt and verification for code generation. Unlike traditional approaches relying on natural language prompts, our benchmark emphasizes the ability of LLMs to interpret and implement functionality directly from test cases, reflecting real-world software development practices. Comprising 1000 diverse challenges across 20 application domains, the benchmark evaluates LLMs on their ability to generate compact, functional code under the constraints of context length and multi-feature complexity. Our findings highlight instruction following and in-context learning as critical capabilities for TDD success, surpassing the importance of general coding proficiency or pretraining knowledge. Through comprehensive evaluation of 19 frontier models, we reveal performance bottlenecks, such as instruction loss in long prompts, and provide a detailed error analysis spanning multiple root causes. This work underscores the practical value of TDD-specific benchmarks and lays the foundation for advancing LLM capabilities in rigorous, application-driven coding scenarios.

テストをプロンプトとして：LLMコード生成のためのテスト駆動開発ベンチマーク

Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation

要旨

Support