TENET：超越验证的测试在代码生成中的应用

摘要

测试驱动开发（TDD）是一种广泛采用的软件工程实践，要求开发者在编写代码的同时创建并执行测试，以确保软件行为得到持续验证与优化。在“氛围编程”时代，开发者越来越多地通过指定高层意图将代码编写任务委托给大型语言模型（LLMs），TDD因此变得更为关键，因为测试用例作为可执行的规范，能够明确界定并验证预期功能，超越自然语言描述和代码上下文所能传达的信息。尽管在TDD框架下的氛围编程前景广阔，但仍面临三大挑战：(1) 选择一套精简而高效的测试集，以提升生成准确性并控制执行工作量；(2) 有效检索相关代码等上下文信息；(3) 系统化利用测试反馈进行有效的代码优化。为应对这些挑战，我们引入了TENET，一个在TDD环境下为复杂现实世界代码库生成函数的LLM代理。TENET具备三大特色：(1) 一种新颖的测试套件机制，精选测试集以最大化目标使用场景的多样性；(2) 一套定制的代理工具集，实现高效的相关代码检索与交互式调试；(3) 基于反思的优化工作流，迭代分析失败案例、补充上下文并应用代码优化。在RepoCod和RepoEval基准测试中，TENET分别以69.08%和81.77%的Pass@1成绩，超越了最佳代理基线9.49和2.17个百分点。此外，这是首次在仓库级上下文下研究测试驱动代码生成，探讨了测试套件的不同方面如何影响TDD环境下LLM代理的性能。

English

Test-Driven Development (TDD) is a widely adopted software engineering practice that requires developers to create and execute tests alongside code implementation, ensuring that software behavior is continuously validated and refined. In the era of vibe coding, where developers increasingly delegate code writing to large language models (LLMs) by specifying high-level intentions, TDD becomes even more crucial, as test cases serve as executable specifications that explicitly define and verify intended functionality beyond what natural-language descriptions and code context can convey. While vibe coding under TDD is promising, there are three main challenges: (1) selecting a small yet effective test suite to improve the generation accuracy and control the execution workload, (2) retrieving context such as relevant code effectively, and (3) systematically using test feedback for effective code refinement. To address these challenges, we introduce TENET, an LLM agent for generating functions in complex real-world repositories under the TDD setting. TENET features three components: (1) a novel test harness mechanism that selects a concise test suite to maximize diversity of target usage scenarios; (2) a tailored agent toolset that performs efficient retrieval of relevant code with interactive debugging; and (3) a reflection-based refinement workflow that iteratively analyzes failures, replenishes context, and applies code refinement. TENET achieves 69.08% and 81.77% Pass@1 on RepoCod and RepoEval benchmarks, outperforming the best agentic baselines by 9.49 and 2.17 percentage points, respectively. In addition, this is the first study of test-driven code generation with repository-level context, examining how different aspects of test suites affect the performance of LLM agents under the TDD setting.

TENET：超越验证的测试在代码生成中的应用

TENET: Leveraging Tests Beyond Validation for Code Generation

摘要

Support