실행 가능에서 배포 가능으로: 요구사항으로부터 풀스택 웹 애플리케이션을 생성하기 위한 다중 에이전트 테스트 주도 개발

초록

코딩 에이전트는 자연어 설명으로 웹 애플리케이션을 생성할 수 있지만, 최근의 벤치마크 연구에 따르면 생성된 애플리케이션의 70% 이상이 기능 요구사항을 충족하지 못한다. 핵심 어려움은 웹 정확성이 소스 파일이나 터미널 출력만으로 평가될 수 없다는 점이다. 애플리케이션을 배포하고, 시뮬레이션된 브라우저 상호작용을 통해 실행해야 하며, 실패를 실행 가능한 수리 신호로 변환해야 하는데, 현재 에이전트는 인간의 중재 없이는 이러한 단계를 수행할 수 없다. 본 연구에서는 이 폐쇄 루프를 자동화하는 프레임워크인 TDDev를 제안한다. 이 프레임워크는 세 단계로 구성된다: (1) 코드 작성 전에 높은 수준의 요구사항을 구조화된 승인 테스트로 변환, (2) 애플리케이션을 배포하고 브라우저 기반 상호작용 시뮬레이션을 통해 검증, (3) 브라우저에서 관찰된 실패를 코딩 에이전트를 위한 구조화된 수리 보고서로 변환한다. TDDev를 통해 웹 애플리케이션 생성을 위한 테스트 주도 개발(TDD) 전략에 대한 최초의 통제된 실증 연구를 수행하였으며, 두 개의 코딩 에이전트, 두 개의 백본 모델, 두 개의 벤치마크에서 네 가지 개발 프로토콜을 비교하였다. TDD 인프라는 TDD가 없는 기준선 대비 생성 품질을 일관되게 34~48% 포인트 향상시켰다. 핵심 발견은 최적의 프로토콜이 모델의 생성 스타일에 의존한다는 것이다. 애플리케이션을 전체적으로 구축하는 모델은 에이전트 기반 강제 방식을 통해 가장 큰 이점을 얻는 반면, 코드를 보수적으로 확장하는 모델은 점진적 강제 방식에서 이점을 얻는다. 프로토콜과 생성 스타일의 불일치는 TDD 이점을 완전히 제거할 뿐만 아니라 토큰 비용을 최대 25배까지 증가시킨다. 사용자 연구는 TDDev가 수동 개발자 개입을 0으로 줄여, 작업 부하를 지속적인 프롬프트 엔지니어링에서 자율적이고 피드백 기반의 개선으로 전환함을 확인한다.

English

Coding agents can generate web applications from natural-language descriptions, yet a recent benchmark study shows that generated applications fail to meet functional requirements in over 70% of cases. The core difficulty is that web correctness cannot be assessed from source files or terminal output: the application must be deployed, exercised through simulated browser interactions, and failures must be translated into actionable repair signals -- steps that current agents cannot perform without human mediation. We present TDDev, a framework that automates this closed loop through three stages: (1) converting high-level requirements into structured acceptance tests before any code is written, (2) deploying the application and validating it through browser-based interaction simulation, and (3) translating browser-observed failures into structured repair reports for the coding agent. Enabled by TDDev, we conduct the first controlled empirical study of Test-driven development (TDD) strategies for web application generation, comparing four development protocols across two coding agents, two backbone models, and two benchmarks. TDD infrastructure consistently improves generation quality by 34--48 percentage points over a no-TDD baseline. The central finding is that the optimal protocol depends on the model's generation style: models that build applications holistically benefit most from agentic enforcement, while models that extend code conservatively benefit from incremental enforcement. Mismatching protocol to generation style eliminates the TDD benefit entirely while multiplying token cost up to 25-fold. A user study confirms that TDDev reduces manual developer intervention to zero, shifting the workload from continuous prompt engineering to autonomous, feedback-driven refinement.