從可運行到可交付：基於多智能體測試驅動開發從需求生成全端網頁應用程式

摘要

编码代理能够根据自然语言描述生成网页应用，然而近期一项基准测试研究表明，生成的应用在超过70%的案例中未能满足功能需求。核心难点在于，网页的正确性无法通过源文件或终端输出进行评估：应用必须被部署，通过模拟浏览器交互进行执行，且失败必须转化为可操作的修复信号——当前代理在没有人工干预的情况下无法完成这些步骤。我们提出了TDDev，一个通过三个阶段自动化这一闭环的框架：（1）在编写任何代码之前，将高层需求转化为结构化的验收测试；（2）部署应用并通过基于浏览器的交互模拟进行验证；（3）将浏览器观察到的失败转化为面向编码代理的结构化修复报告。借助TDDev，我们首次对网页应用生成的测试驱动开发（TDD）策略进行了受控实证研究，比较了两种编码代理、两种骨干模型及两种基准测试下的四种开发协议。TDD基础设施普遍使生成质量比无TDD基线提升了34至48个百分点。核心发现是，最优协议取决于模型的生成风格：整体构建应用的模型最受益于代理强制执行，而保守扩展代码的模型则更受益于增量强制执行。协议与生成风格不匹配会完全消除TDD的优势，同时使令牌成本增加高达25倍。一项用户研究证实，TDDev将人工开发者干预减少为零，将工作负担从持续的提示工程转变为自主的、基于反馈的优化。

English

Coding agents can generate web applications from natural-language descriptions, yet a recent benchmark study shows that generated applications fail to meet functional requirements in over 70% of cases. The core difficulty is that web correctness cannot be assessed from source files or terminal output: the application must be deployed, exercised through simulated browser interactions, and failures must be translated into actionable repair signals -- steps that current agents cannot perform without human mediation. We present TDDev, a framework that automates this closed loop through three stages: (1) converting high-level requirements into structured acceptance tests before any code is written, (2) deploying the application and validating it through browser-based interaction simulation, and (3) translating browser-observed failures into structured repair reports for the coding agent. Enabled by TDDev, we conduct the first controlled empirical study of Test-driven development (TDD) strategies for web application generation, comparing four development protocols across two coding agents, two backbone models, and two benchmarks. TDD infrastructure consistently improves generation quality by 34--48 percentage points over a no-TDD baseline. The central finding is that the optimal protocol depends on the model's generation style: models that build applications holistically benefit most from agentic enforcement, while models that extend code conservatively benefit from incremental enforcement. Mismatching protocol to generation style eliminates the TDD benefit entirely while multiplying token cost up to 25-fold. A user study confirms that TDDev reduces manual developer intervention to zero, shifting the workload from continuous prompt engineering to autonomous, feedback-driven refinement.