実行可能から出荷可能へ：要件からフルスタックWebアプリケーションを生成するためのマルチエージェントテスト駆動開発

要旨

コーディングエージェントは自然言語による記述からウェブアプリケーションを生成できるものの、最近のベンチマーク研究によれば、生成されたアプリケーションは70%以上のケースで機能要件を満たさないことが示されている。その核心的な難しさは、ソースファイルやターミナル出力からウェブの正しさを評価できない点にある。すなわち、アプリケーションをデプロイし、シミュレートされたブラウザ操作を通じてテストし、障害を実行可能な修復シグナルに変換する必要があるが、現在のエージェントは人間の介在なしにこれらのステップを実行できない。我々は、このクローズドループを自動化するフレームワークTDDevを提案する。TDDevは3つの段階から成る。(1) コードが書かれる前に高レベルの要件を構造化された受入テストに変換する。(2) アプリケーションをデプロイし、ブラウザベースの操作シミュレーションを通じて検証する。(3) ブラウザで観測された障害をコーディングエージェントのための構造化された修復レポートに変換する。TDDevにより、初めての制御された実証研究として、ウェブアプリケーション生成におけるテスト駆動開発（TDD）戦略を、2つのコーディングエージェント、2つのバックボーンモデル、2つのベンチマークにわたって4つの開発プロトコルを比較することで調査する。TDD基盤は、TDDなしのベースラインと比較して生成品質を一貫して34〜48パーセントポイント向上させる。主要な発見は、最適なプロトコルがモデルの生成スタイルに依存するという点である。アプリケーションを全体的に構築するモデルは、エージェントベースの強制適用から最も恩恵を受け、一方、コードを慎重に拡張するモデルは、段階的な強制適用から恩恵を受ける。生成スタイルにプロトコルが適合しない場合、TDDの利点は完全に失われ、トークンコストは最大25倍に増加する。ユーザー実験により、TDDevは手動による開発者の介入をゼロにし、継続的なプロンプトエンジニアリングから自律的なフィードバック駆動型の改良へと作業負荷を移行することを確認した。

English

Coding agents can generate web applications from natural-language descriptions, yet a recent benchmark study shows that generated applications fail to meet functional requirements in over 70% of cases. The core difficulty is that web correctness cannot be assessed from source files or terminal output: the application must be deployed, exercised through simulated browser interactions, and failures must be translated into actionable repair signals -- steps that current agents cannot perform without human mediation. We present TDDev, a framework that automates this closed loop through three stages: (1) converting high-level requirements into structured acceptance tests before any code is written, (2) deploying the application and validating it through browser-based interaction simulation, and (3) translating browser-observed failures into structured repair reports for the coding agent. Enabled by TDDev, we conduct the first controlled empirical study of Test-driven development (TDD) strategies for web application generation, comparing four development protocols across two coding agents, two backbone models, and two benchmarks. TDD infrastructure consistently improves generation quality by 34--48 percentage points over a no-TDD baseline. The central finding is that the optimal protocol depends on the model's generation style: models that build applications holistically benefit most from agentic enforcement, while models that extend code conservatively benefit from incremental enforcement. Mismatching protocol to generation style eliminates the TDD benefit entirely while multiplying token cost up to 25-fold. A user study confirms that TDDev reduces manual developer intervention to zero, shifting the workload from continuous prompt engineering to autonomous, feedback-driven refinement.