WebCompass:面向程式語言模型的多模態網頁編碼評估框架
WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models
April 20, 2026
作者: Xinping Lei, Xinyu Che, Junqi Xiong, Chenchen Zhang, Yukai Huang, Chenyu Zhou, Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, Jinhua Hao, Ken Deng, Zizheng Zhan, Han Li, Dailin Li, Yifan Yao, Ming Sun, Zhaoxiang Zhang, Jiaheng Liu
cs.AI
摘要
大型語言模型正快速演進為能夠進行端到端網頁編碼的互動式程式設計代理,然而現有基準測試僅評估此能力的狹隘面向,通常僅針對文本條件生成採用靜態正確性指標,致使視覺保真度、互動品質與程式碼庫層級推理能力大多未被量測。我們推出多模態基準測試WebCompass,提供網頁工程能力的統一生命週期評估。基於真實網頁編碼實為生成、編輯與修復的迭代循環,WebCompass涵蓋三種輸入模態(文本、圖像、影片)與三種任務類型(生成、編輯、修復),形成七個對應專業工作流程的任務類別。透過多階段人機協同流程,我們精選涵蓋15個生成領域、16種編輯操作類型與11種修復缺陷類型的實例,每個實例均標註易/中/難等級。評估方面,我們採用檢查表引導的LLM-as-a-Judge協議處理編輯與修復任務,並提出創新的Agent-as-a-Judge範式用於生成任務——該範式能在真實瀏覽器中自主執行生成網站,透過模型上下文協議(MCP)探索互動行為,並迭代合成定向測試案例,高度逼近人工驗收測試。我們評估具代表性的閉源與開源模型後發現:(1)閉源模型仍具顯著優勢且能力更均衡;(2)編輯與修復呈現差異化難度特徵,修復能較好保持互動性但執行挑戰更大;(3)美學設計是持續性瓶頸,對開源模型尤為明顯;(4)框架選擇實質影響結果,Vue持續表現困難,而React和Vanilla/HTML則依任務類型呈現較強性能。
English
Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web coding is an iterative cycle of generation, editing, and repair, WebCompass spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Through a multi-stage, human-in-the-loop pipeline, we curate instances covering 15 generation domains, 16 editing operation types, and 11 repair defect types, each annotated at Easy/Medium/Hard levels. For evaluation, we adopt a checklist-guided LLM-as-a-Judge protocol for editing and repair, and propose a novel Agent-as-a-Judge paradigm for generation that autonomously executes generated websites in a real browser, explores interactive behaviors via the Model Context Protocol (MCP), and iteratively synthesizes targeted test cases, closely approximating human acceptance testing. We evaluate representative closed-source and open-source models and observe that: (1) closed-source models remain substantially stronger and more balanced; (2) editing and repair exhibit distinct difficulty profiles, with repair preserving interactivity better but remaining execution-challenging; (3) aesthetics is the most persistent bottleneck, especially for open-source models; and (4) framework choice materially affects outcomes, with Vue consistently challenging while React and Vanilla/HTML perform more strongly depending on task type.