InteractWeb-Bench：多模态智能体能否在交互式网站生成中摆脱盲目执行困境？

摘要

随着多模态大语言模型（MLLM）与代码智能体的发展，网站开发已从人工编程转向基于智能体的项目级代码生成。现有基准测试依赖于理想化假设，特别是针对结构清晰、信息丰富的输入和静态执行环境。然而实际开发过程存在关键瓶颈：非专业用户模糊低质的指令与模型理解之间的语义错位，导致我们称之为"盲执行"的失效模式。为弥补这一差距，我们推出InteractWeb-Bench——首个面向非专业低代码用户场景的多模态交互式网站生成基准。该基准引入四类用户智能体及角色驱动的指令扰动，基于需求工程缺陷分类体系，系统模拟包含模糊性、冗余性和矛盾性的多样化用户行为。我们为智能体开发了交互式执行环境，其统一动作空间包含澄清、实现、验证和提交四个模块，支持迭代式意图细化、代码生成和基于视觉反馈的验证。大量实验与分析表明，前沿的MLLM智能体仍受困于盲执行模式，暴露出意图识别与自适应交互方面的局限性。

English

With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors, including ambiguity, redundancy, and contradiction, grounded in requirement engineering defect taxonomies. We develop an interactive execution environment for agents, featuring a unified action space comprising Clarify, Implement, Verify, and Submit, enabling iterative intent refinement, code synthesis, and visual feedback-based validation. Extensive experiments and analysis reveal that frontier MLLM-based agents remain trapped in blind execution, exposing limitations in intent recognition and adaptive interaction.

InteractWeb-Bench：多模态智能体能否在交互式网站生成中摆脱盲目执行困境？

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

摘要

Support