从可运行到可交付:多智能体测试驱动开发从需求生成全栈Web应用
From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements
May 17, 2026
作者: Yuxuan Wan, Tingshuo Liang, Jiakai Xu, Jingyu Xiao, Yintong Huo, Michael R Lyu
cs.AI
摘要
编码智能体能够根据自然语言描述生成Web应用程序,然而近期一项基准研究显示,超过70%的生成应用未能满足功能需求。其核心难点在于:Web应用的正确性无法从源文件或终端输出中直接评估——应用程序必须经过部署、通过模拟浏览器交互进行测试,并且需要将观察到的故障转化为可操作的修复信号——而当前智能体无法在无人干预的情况下独立完成这些步骤。
我们提出TDDev框架,通过三个阶段实现这一闭环的自动化:(1)在编写任何代码之前,将高层需求转化为结构化的验收测试;(2)部署应用程序并通过基于浏览器的交互模拟进行验证;(3)将浏览器观察到的故障转化为面向编码智能体的结构化修复报告。借助TDDev,我们首次对面向Web应用生成的测试驱动开发(TDD)策略进行了受控实证研究,比较了两种编码智能体、两种骨干模型和两种基准测试下四种开发协议的表现。TDD基础设施相较于无TDD基线,持续将生成质量提升34至48个百分点。核心发现是:最优协议取决于模型的生成风格——整体构建应用的模型最受益于智能体强制执行,而保守扩展代码的模型则最受益于增量式强制执行。协议与生成风格不匹配不仅会完全消除TDD带来的收益,还会使令牌成本增加高达25倍。一项用户研究证实,TDDev将人工开发者干预降至零,使工作负载从持续提示工程转变为自主的、反馈驱动的优化。
English
Coding agents can generate web applications from natural-language descriptions, yet a recent benchmark study shows that generated applications fail to meet functional requirements in over 70% of cases. The core difficulty is that web correctness cannot be assessed from source files or terminal output: the application must be deployed, exercised through simulated browser interactions, and failures must be translated into actionable repair signals -- steps that current agents cannot perform without human mediation.
We present TDDev, a framework that automates this closed loop through three stages: (1) converting high-level requirements into structured acceptance tests before any code is written, (2) deploying the application and validating it through browser-based interaction simulation, and (3) translating browser-observed failures into structured repair reports for the coding agent. Enabled by TDDev, we conduct the first controlled empirical study of Test-driven development (TDD) strategies for web application generation, comparing four development protocols across two coding agents, two backbone models, and two benchmarks. TDD infrastructure consistently improves generation quality by 34--48 percentage points over a no-TDD baseline. The central finding is that the optimal protocol depends on the model's generation style: models that build applications holistically benefit most from agentic enforcement, while models that extend code conservatively benefit from incremental enforcement. Mismatching protocol to generation style eliminates the TDD benefit entirely while multiplying token cost up to 25-fold. A user study confirms that TDDev reduces manual developer intervention to zero, shifting the workload from continuous prompt engineering to autonomous, feedback-driven refinement.