ChatPaper.aiChatPaper

测试驱动智能体定义(TDAD):基于行为规范编译工具使用型智能体

Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

March 9, 2026
作者: Tzafrir Rehan
cs.AI

摘要

我们提出测试驱动的AI智能体定义(TDAD)方法论,该方法将智能体提示词视为可编译产物:工程师提供行为规范,编码智能体将其转化为可执行测试,再由第二编码智能体迭代优化提示词直至测试通过。在生产环境中部署使用工具的LLM智能体需要可衡量的行为合规性,而当前开发实践无法满足这一需求。细微的提示词改动会导致隐性回归,工具误用难以检测,策略违规往往在部署后才暴露。为规避规范博弈,TDAD引入三大机制:(1)可见/隐藏测试分离——在编译阶段保留评估测试;(2)语义变异测试——通过后编译智能体生成合理的错误提示词变体,并由测试框架检测测试套件能否识别这些变异;(3)规范演化场景——在需求变更时量化回归安全性。我们在SpecSuite-Core基准上评估TDAD,该基准包含四个深度规范化的智能体,涵盖策略合规性、 grounded 分析、操作规程遵循和确定性执行。经过24次独立试验,TDAD实现92%的v1编译成功率,隐藏测试通过率均值达97%;演化后的规范编译成功率为58%,多数失败运行案例仅未通过1-2项隐藏测试,但能通过所有可见测试;变异测试得分为86-100%,v2隐藏测试通过率为78%,回归安全得分达97%。本实现已作为开放基准发布于https://github.com/f-labs-io/tdad-paper-code。
English
We present Test-Driven AI Agent Definition (TDAD), a methodology that treats agent prompts as compiled artifacts: engineers provide behavioral specifications, a coding agent converts them into executable tests, and a second coding agent iteratively refines the prompt until tests pass. Deploying tool-using LLM agents in production requires measurable behavioral compliance that current development practices cannot provide. Small prompt changes cause silent regressions, tool misuse goes undetected, and policy violations emerge only after deployment. To mitigate specification gaming, TDAD introduces three mechanisms: (1) visible/hidden test splits that withhold evaluation tests during compilation, (2) semantic mutation testing via a post-compilation agent that generates plausible faulty prompt variants, with the harness measuring whether the test suite detects them, and (3) spec evolution scenarios that quantify regression safety when requirements change. We evaluate TDAD on SpecSuite-Core, a benchmark of four deeply-specified agents spanning policy compliance, grounded analytics, runbook adherence, and deterministic enforcement. Across 24 independent trials, TDAD achieves 92% v1 compilation success with 97% mean hidden pass rate; evolved specifications compile at 58%, with most failed runs passing all visible tests except 1-2, and show 86-100% mutation scores, 78% v2 hidden pass rate, and 97% regression safety scores. The implementation is available as an open benchmark at https://github.com/f-labs-io/tdad-paper-code.
PDF51March 12, 2026