WebCompass：面向代码语言模型的多模态网页编程评估框架

摘要

大型语言模型正迅速进化为能够进行端到端网页编程的交互式代码生成体，然而现有基准测试仅评估了该能力的局部维度——通常局限于文本条件生成与静态正确性指标，致使视觉保真度、交互质量及代码库级推理能力长期缺乏有效衡量。我们推出多模态基准WebCompass，实现对网页工程能力的全生命周期统一评估。基于真实网页编程实为生成、编辑与修复的迭代循环这一认知，WebCompass涵盖文本、图像、视频三种输入模态与生成、编辑、修复三类任务类型，形成映射专业工作流的七大任务范畴。通过多阶段人机协同流程，我们构建了覆盖15个生成领域、16种编辑操作类型及11类修复缺陷的实例库，每个实例均标注易/中/难三级难度。评估方面，我们采用清单引导的LLM-as-a-Judge协议处理编辑与修复任务，并针对生成任务提出创新的Agent-as-a-Judge范式：该范式在真实浏览器中自动执行生成网站，通过模型上下文协议（MCP）探索交互行为，并迭代合成定向测试用例，高度逼近人工验收测试。对代表性闭源与开源模型的评估表明：（1）闭源模型仍保持显著优势且能力更均衡；（2）编辑与修复呈现差异化难度特征，修复任务能更好保持交互性但执行挑战更大；（3）美学表现是持续性瓶颈，对开源模型尤为突出；（4）框架选择显著影响结果，Vue持续表现吃力，而React和Vanilla/HTML则根据任务类型呈现不同优势。

English

Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web coding is an iterative cycle of generation, editing, and repair, WebCompass spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Through a multi-stage, human-in-the-loop pipeline, we curate instances covering 15 generation domains, 16 editing operation types, and 11 repair defect types, each annotated at Easy/Medium/Hard levels. For evaluation, we adopt a checklist-guided LLM-as-a-Judge protocol for editing and repair, and propose a novel Agent-as-a-Judge paradigm for generation that autonomously executes generated websites in a real browser, explores interactive behaviors via the Model Context Protocol (MCP), and iteratively synthesizes targeted test cases, closely approximating human acceptance testing. We evaluate representative closed-source and open-source models and observe that: (1) closed-source models remain substantially stronger and more balanced; (2) editing and repair exhibit distinct difficulty profiles, with repair preserving interactivity better but remaining execution-challenging; (3) aesthetics is the most persistent bottleneck, especially for open-source models; and (4) framework choice materially affects outcomes, with Vue consistently challenging while React and Vanilla/HTML perform more strongly depending on task type.