TENET: コード生成における検証を超えたテストの活用

要旨

テスト駆動開発（TDD）は、開発者がコード実装と並行してテストを作成・実行し、ソフトウェアの動作を継続的に検証・改善することを求める、広く採用されているソフトウェア工学のプラクティスです。開発者が高レベルの意図を指定することでコード作成を大規模言語モデル（LLM）に委ねる「バイブコーディング」の時代において、TDDはさらに重要性を増しています。なぜなら、テストケースは実行可能な仕様として機能し、自然言語の記述やコードの文脈だけでは伝えきれない意図された機能を明示的に定義・検証するからです。TDD下でのバイブコーディングは有望ですが、主に3つの課題があります：(1) 生成精度を向上させ、実行負荷を制御するために、小さくても効果的なテストスイートを選択すること、(2) 関連するコードなどのコンテキストを効率的に取得すること、(3) テストのフィードバックを体系的に活用して効果的なコード改善を行うことです。これらの課題に対処するため、TDD設定下で複雑な実世界のリポジトリにおける関数生成を行うLLMエージェント「TENET」を提案します。TENETは3つのコンポーネントを特徴とします：(1) ターゲット使用シナリオの多様性を最大化するために簡潔なテストスイートを選択する新しいテストハーネスメカニズム、(2) インタラクティブなデバッグを伴う関連コードの効率的な取得を行う特化したエージェントツールセット、(3) 失敗を反復的に分析し、コンテキストを補充し、コード改善を適用するリフレクションベースの改善ワークフローです。TENETは、RepoCodとRepoEvalベンチマークでそれぞれ69.08%と81.77%のPass@1を達成し、最良のエージェントベースラインを9.49ポイントと2.17ポイント上回りました。さらに、リポジトリレベルのコンテキストを用いたテスト駆動コード生成の初めての研究であり、TDD設定下でのLLMエージェントの性能にテストスイートの異なる側面がどのように影響するかを検証しています。

English

Test-Driven Development (TDD) is a widely adopted software engineering practice that requires developers to create and execute tests alongside code implementation, ensuring that software behavior is continuously validated and refined. In the era of vibe coding, where developers increasingly delegate code writing to large language models (LLMs) by specifying high-level intentions, TDD becomes even more crucial, as test cases serve as executable specifications that explicitly define and verify intended functionality beyond what natural-language descriptions and code context can convey. While vibe coding under TDD is promising, there are three main challenges: (1) selecting a small yet effective test suite to improve the generation accuracy and control the execution workload, (2) retrieving context such as relevant code effectively, and (3) systematically using test feedback for effective code refinement. To address these challenges, we introduce TENET, an LLM agent for generating functions in complex real-world repositories under the TDD setting. TENET features three components: (1) a novel test harness mechanism that selects a concise test suite to maximize diversity of target usage scenarios; (2) a tailored agent toolset that performs efficient retrieval of relevant code with interactive debugging; and (3) a reflection-based refinement workflow that iteratively analyzes failures, replenishes context, and applies code refinement. TENET achieves 69.08% and 81.77% Pass@1 on RepoCod and RepoEval benchmarks, outperforming the best agentic baselines by 9.49 and 2.17 percentage points, respectively. In addition, this is the first study of test-driven code generation with repository-level context, examining how different aspects of test suites affect the performance of LLM agents under the TDD setting.

TENET: コード生成における検証を超えたテストの活用

TENET: Leveraging Tests Beyond Validation for Code Generation

要旨

Support