测试驱动的AI智能体定义（TDAD）：基于行为规范编译工具使用型智能体

摘要

我们提出测试驱动的AI智能体定义（TDAD）方法，该方法将智能体提示词视为编译产物：工程师提供行为规约，编码智能体将其转化为可执行测试，再由第二编码智能体迭代优化提示词直至通过测试。在生产环境中部署工具调用型LLM智能体需要可衡量的行为合规性，而现有开发实践无法满足这一需求。微小提示词改动会导致隐性回归，工具误用难以察觉，策略违规往往在部署后才暴露。为规避规约博弈，TDAD引入三大机制：（1）显隐式测试分离，在编译阶段保留评估测试；（2）语义变异测试，通过后编译智能体生成合理错误提示词变体，并由测试框架检测测试套件的识别能力；（3）规约演化场景，在需求变更时量化回归安全性。我们在SpecSuite-Core基准上评估TDAD，该基准包含四个深度规约的智能体，涵盖策略合规、 grounded 分析、操作规程遵循和确定性执行。经过24次独立试验，TDAD实现92%的v1编译成功率，隐式测试通过率均值达97%；演化规约的编译成功率为58%，多数失败运行仅因1-2项测试未通过，变异测试得分达86-100%，v2隐式通过率78%，回归安全得分97%。实现代码已作为开放基准发布于https://github.com/f-labs-io/tdad-paper-code。

English

We present Test-Driven AI Agent Definition (TDAD), a methodology that treats agent prompts as compiled artifacts: engineers provide behavioral specifications, a coding agent converts them into executable tests, and a second coding agent iteratively refines the prompt until tests pass. Deploying tool-using LLM agents in production requires measurable behavioral compliance that current development practices cannot provide. Small prompt changes cause silent regressions, tool misuse goes undetected, and policy violations emerge only after deployment. To mitigate specification gaming, TDAD introduces three mechanisms: (1) visible/hidden test splits that withhold evaluation tests during compilation, (2) semantic mutation testing via a post-compilation agent that generates plausible faulty prompt variants, with the harness measuring whether the test suite detects them, and (3) spec evolution scenarios that quantify regression safety when requirements change. We evaluate TDAD on SpecSuite-Core, a benchmark of four deeply-specified agents spanning policy compliance, grounded analytics, runbook adherence, and deterministic enforcement. Across 24 independent trials, TDAD achieves 92% v1 compilation success with 97% mean hidden pass rate; evolved specifications compile at 58%, with most failed runs passing all visible tests except 1-2, and show 86-100% mutation scores, 78% v2 hidden pass rate, and 97% regression safety scores. The implementation is available as an open benchmark at https://github.com/f-labs-io/tdad-paper-code.