テスト駆動型AIエージェント定義（TDAD）：行動仕様からのツール利用エージェントのコンパイル

要旨

本論文では、エージェントのプロンプトをコンパイル成果物として扱う手法であるTest-Driven AI Agent Definition（TDAD）を提案する。この手法では、エンジニアが行動仕様を提供し、コーディングエージェントがそれを実行可能なテストに変換した後、第二のコーディングエージェントがテスト合格までプロンプトを反復的に改良する。ツール利用LLMエージェントを本番環境に導入するには、現在の開発手法では達成できない測定可能な行動準拠性が求められる。わずかなプロンプト変更が検知不能な機能退行を引き起こし、ツール誤用は検出されず、ポリシー違反はデプロイ後に初めて顕在化する。仕様ゲーミングを軽減するため、TDADは3つのメカニズムを導入する：（1）コンパイル時に評価テストを非公開とする可視/非公開テスト分割、（2）コンパイル後エージェントによる妥当な欠陥プロンプト変種を生成する意味的変異テスト（テストスイートの検出能力をハーネスが計測）、（3）要件変更時の退行安全性を定量化する仕様進化シナリオ。TDADを、ポリシー準拠、根拠に基づく分析、ランブック遵守、決定的実行の4領域を網羅する詳細仕様化エージェントベンチマークSpecSuite-Coreで評価した。24回の独立試行において、TDADはv1コンパイル成功率92%（非公開テスト平均合格率97%）を達成。進化仕様では58%がコンパイル成功し、失敗実行の大半は1-2テストを除く全可視テストを通過、86-100%の変異スコア、v2非公開テスト合格率78%、退行安全性スコア97%を示した。実装はhttps://github.com/f-labs-io/tdad-paper-code でオープンベンチマークとして公開されている。

English

We present Test-Driven AI Agent Definition (TDAD), a methodology that treats agent prompts as compiled artifacts: engineers provide behavioral specifications, a coding agent converts them into executable tests, and a second coding agent iteratively refines the prompt until tests pass. Deploying tool-using LLM agents in production requires measurable behavioral compliance that current development practices cannot provide. Small prompt changes cause silent regressions, tool misuse goes undetected, and policy violations emerge only after deployment. To mitigate specification gaming, TDAD introduces three mechanisms: (1) visible/hidden test splits that withhold evaluation tests during compilation, (2) semantic mutation testing via a post-compilation agent that generates plausible faulty prompt variants, with the harness measuring whether the test suite detects them, and (3) spec evolution scenarios that quantify regression safety when requirements change. We evaluate TDAD on SpecSuite-Core, a benchmark of four deeply-specified agents spanning policy compliance, grounded analytics, runbook adherence, and deterministic enforcement. Across 24 independent trials, TDAD achieves 92% v1 compilation success with 97% mean hidden pass rate; evolved specifications compile at 58%, with most failed runs passing all visible tests except 1-2, and show 86-100% mutation scores, 78% v2 hidden pass rate, and 97% regression safety scores. The implementation is available as an open benchmark at https://github.com/f-labs-io/tdad-paper-code.

テスト駆動型AIエージェント定義（TDAD）：行動仕様からのツール利用エージェントのコンパイル

Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

要旨

Support