InteractWeb-Bench: 멀티모달 에이전트가 인터랙티브 웹사이트 생성에서 맹목적 실행을 벗어날 수 있을까?

초록

멀티모달 대규모 언어 모델(MLLM)과 코딩 에이전트의 발전으로 웹사이트 개발은 수동 프로그래밍에서 에이전트 기반 프로젝트 수준 코드 합성으로 전환되었습니다. 기존 벤치마크는 특히 구조화되고 정보가 풍부한 입력과 정적 실행 환경이라는 이상화된 가정에 의존합니다. 이와 대조적으로, 실제 개발 환경은 비전문 사용자의 모호하고 저품질 지시와 모델 이해 사이의 의미론적 불일치라는 중요한 병목 현상에 제약을 받으며, 이는 우리가 '맹목적 실행'이라고 명명한 실패 모드를 초래합니다. 이러한 격차를 해결하기 위해 우리는 비전문 로우코드 사용자 조건 하의 웹사이트 생성을 위한 최초의 멀티모달 인터랙티브 벤치마크인 InteractWeb-Bench를 소개합니다. InteractWeb-Bench는 요구사항 공학 결함 분류 체계에 기반하여 모호성, 중복성, 모순을 포함한 다양한 사용자 행동을 체계적으로 시뮬레이션하기 위해 4가지 유형의 사용자 에이전트와 페르소나 기반 지시 변형을 도입합니다. 우리는 Clarify, Implement, Verify, Submit으로 구성된 통합 행동 공간을 갖춘 인터랙티브 실행 환경을 개발하여 반복적인 의도 정제, 코드 합성, 시각적 피드백 기반 검증을 가능하게 합니다. 광범위한 실험과 분석 결과, 최첨단 MLLM 기반 에이전트들은 여전히 맹목적 실행에 머물러 있으며, 이는 의도 인식과 적응형 상호작용 분야의 한계를 드러냅니다.

English

With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors, including ambiguity, redundancy, and contradiction, grounded in requirement engineering defect taxonomies. We develop an interactive execution environment for agents, featuring a unified action space comprising Clarify, Implement, Verify, and Submit, enabling iterative intent refinement, code synthesis, and visual feedback-based validation. Extensive experiments and analysis reveal that frontier MLLM-based agents remain trapped in blind execution, exposing limitations in intent recognition and adaptive interaction.

InteractWeb-Bench: 멀티모달 에이전트가 인터랙티브 웹사이트 생성에서 맹목적 실행을 벗어날 수 있을까?

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

초록

Support