TENET：超越验证的测试在代码生成中的运用

摘要

測試驅動開發（Test-Driven Development, TDD）是一種廣泛採用的軟體工程實踐，要求開發者在編寫代碼的同時創建並執行測試，以確保軟體行為被持續驗證與精進。在「氛圍編碼」時代，開發者越來越多地通過指定高層意圖將代碼編寫任務委託給大型語言模型（LLMs），TDD因此變得更加關鍵，因為測試案例作為可執行的規格，能夠明確地定義並驗證預期功能，這超越了自然語言描述和代碼上下文所能傳達的範圍。儘管在TDD框架下的氛圍編碼前景看好，但仍面臨三大挑戰：(1) 選擇一個小而有效的測試套件，以提高生成準確性並控制執行工作量；(2) 有效檢索相關代碼等上下文；(3) 系統化地利用測試反饋進行有效的代碼精煉。為應對這些挑戰，我們引入了TENET，這是一個在TDD設定下為複雜現實世界代碼庫生成函數的LLM代理。TENET具備三大組件：(1) 一種新穎的測試框架機制，選擇簡潔的測試套件以最大化目標使用場景的多樣性；(2) 一套定制的代理工具集，能夠高效檢索相關代碼並進行互動式調試；(3) 基於反思的精煉工作流程，迭代分析失敗、補充上下文並應用代碼精煉。TENET在RepoCod和RepoEval基準測試中分別達到了69.08%和81.77%的Pass@1成績，分別比最佳代理基線高出9.49和2.17個百分點。此外，這是首次在倉庫級別上下文中研究測試驅動的代碼生成，探討了測試套件的不同方面如何影響TDD設定下LLM代理的性能。

English

Test-Driven Development (TDD) is a widely adopted software engineering practice that requires developers to create and execute tests alongside code implementation, ensuring that software behavior is continuously validated and refined. In the era of vibe coding, where developers increasingly delegate code writing to large language models (LLMs) by specifying high-level intentions, TDD becomes even more crucial, as test cases serve as executable specifications that explicitly define and verify intended functionality beyond what natural-language descriptions and code context can convey. While vibe coding under TDD is promising, there are three main challenges: (1) selecting a small yet effective test suite to improve the generation accuracy and control the execution workload, (2) retrieving context such as relevant code effectively, and (3) systematically using test feedback for effective code refinement. To address these challenges, we introduce TENET, an LLM agent for generating functions in complex real-world repositories under the TDD setting. TENET features three components: (1) a novel test harness mechanism that selects a concise test suite to maximize diversity of target usage scenarios; (2) a tailored agent toolset that performs efficient retrieval of relevant code with interactive debugging; and (3) a reflection-based refinement workflow that iteratively analyzes failures, replenishes context, and applies code refinement. TENET achieves 69.08% and 81.77% Pass@1 on RepoCod and RepoEval benchmarks, outperforming the best agentic baselines by 9.49 and 2.17 percentage points, respectively. In addition, this is the first study of test-driven code generation with repository-level context, examining how different aspects of test suites affect the performance of LLM agents under the TDD setting.

TENET：超越验证的测试在代码生成中的运用

TENET: Leveraging Tests Beyond Validation for Code Generation

摘要

Support