WebCompass: コード言語モデルのためのマルチモーダルWebコーディング評価に向けて

要旨

大規模言語モデルは、エンドツーエンドのWebコーディングが可能な対話型コーディングエージェントへと急速に進化している。しかし、既存のベンチマークはこの能力のごく一部、典型的にはテキスト条件付き生成と静的正確性指標のみを評価しており、視覚的忠実度、インタラクションの質、コードベースレベルの推論はほとんど測定されていない。本論文では、Webエンジニアリング能力の統合的なライフサイクル評価を提供するマルチモーダルベンチマーク「WebCompass」を提案する。実世界のWebコーディングが、生成、編集、修正の反復的サイクルであることを踏まえ、WebCompassは3つの入力モダリティ（テキスト、画像、動画）と3つのタスクタイプ（生成、編集、修正）を網羅し、専門的なワークフローを反映した7つのタスクカテゴリを構成する。マルチステージかつヒューマンインザループのパイプラインを通じて、15の生成ドメイン、16の編集操作タイプ、11の修正欠陥タイプをカバーするインスタンスを精選し、それぞれにEasy/Medium/Hardの難易度注釈を付与した。評価においては、編集と修正タスクにはチェックリスト誘導型のLLM-as-a-Judgeプロトコルを採用し、生成タスクには新規のAgent-as-a-Judgeパラダイムを提案する。後者は、生成されたWebサイトを実ブラウザで自律的に実行し、Model Context Protocol (MCP) を介して対話的挙動を探索し、反復的に特定のテストケースを合成することで、人間の受け入れテストに近似した評価を実現する。代表的なクローズドソースおよびオープンソースモデルを評価した結果、(1) クローズドソースモデルは依然として大幅に強力かつバランスが取れている、(2) 編集と修正は異なる難易度プロファイルを示し、修正はインタラクティブ性の維持に優れるが実行難易度が高い、(3) 美的品質が最も持続的なボトルネックであり、特にオープンソースモデルで顕著である、(4) フレームワークの選択が結果に大きく影響し、Vueは一貫して難易度が高く、ReactとVanilla/HTMLはタスクタイプに応じて強く機能する、という知見が得られた。

English

Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web coding is an iterative cycle of generation, editing, and repair, WebCompass spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Through a multi-stage, human-in-the-loop pipeline, we curate instances covering 15 generation domains, 16 editing operation types, and 11 repair defect types, each annotated at Easy/Medium/Hard levels. For evaluation, we adopt a checklist-guided LLM-as-a-Judge protocol for editing and repair, and propose a novel Agent-as-a-Judge paradigm for generation that autonomously executes generated websites in a real browser, explores interactive behaviors via the Model Context Protocol (MCP), and iteratively synthesizes targeted test cases, closely approximating human acceptance testing. We evaluate representative closed-source and open-source models and observe that: (1) closed-source models remain substantially stronger and more balanced; (2) editing and repair exhibit distinct difficulty profiles, with repair preserving interactivity better but remaining execution-challenging; (3) aesthetics is the most persistent bottleneck, especially for open-source models; and (4) framework choice materially affects outcomes, with Vue consistently challenging while React and Vanilla/HTML perform more strongly depending on task type.

WebCompass: コード言語モデルのためのマルチモーダルWebコーディング評価に向けて

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

要旨

Support