InteractWeb-Bench：多模态智能体能否在交互式网站生成中摆脱盲目执行困境？

摘要

随着多模态大语言模型（MLLM）与代码智能体的发展，网站开发已从人工编程转向基于智能体的项目级代码生成。现有基准测试依赖理想化假设，尤其针对结构清晰、信息丰富的输入及静态执行环境。然而，实际开发过程存在关键瓶颈：非专业用户模糊、低质量的指令与模型理解之间的语义错位，导致我们称之为"盲执行"的失效模式。为填补这一空白，我们推出InteractWeb-Bench——首个面向非专业低代码用户场景的多模态交互式网站生成基准。该基准通过四类用户智能体及角色驱动的指令扰动，基于需求工程缺陷分类体系系统模拟模糊性、冗余性和矛盾性等多样化用户行为。我们构建了支持交互执行的智能体环境，其统一行动空间包含澄清、实现、验证、提交四个维度，支持迭代式意图细化、代码生成及基于视觉反馈的验证。大量实验与分析表明，前沿的MLLM智能体仍受困于盲执行模式，暴露出意图识别与自适应交互能力的局限。

English

With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors, including ambiguity, redundancy, and contradiction, grounded in requirement engineering defect taxonomies. We develop an interactive execution environment for agents, featuring a unified action space comprising Clarify, Implement, Verify, and Submit, enabling iterative intent refinement, code synthesis, and visual feedback-based validation. Extensive experiments and analysis reveal that frontier MLLM-based agents remain trapped in blind execution, exposing limitations in intent recognition and adaptive interaction.