WebRISE: MLLM生成Webアーティファクトのための要求誘導型状態評価

要旨

MLLMが生成するWebアーティファクトに対する既存のベンチマークは、局所的な証拠を通じてインタラクションを評価するが、ページの動作を決定する要求誘発状態と遷移を見落としている。我々はWebRISEを提案する。これは、タスク要求を実装非依存のブラウザ実行のための観測可能な状態、ユーザー意図遷移、DOM/ビジュアルアサーションからなるインタラクション契約グラフ（ICG）にまとめるものである。WebRISEは、5つの入力モダリティ（テキスト、マークダウン、スケッチ、画像、動画）にわたる442タスクを対象とし、5,495の遷移と5,271の要件チェックを含み、ユーザーが明示した機能と暗黙的なプロダクトレベルの制約を区別する。14のMLLMにおいて、最も強力なモデルでも遷移有効性は65.6%、要件カバレッジは66.3%に留まり、視覚品質は動作の代理指標とはならない（マークダウンにおけるQwen3.6-35B-A3B：V=80.8、T=15.5）。動画は最も強いインタラクション信号を与える（テキスト比+10.6ppの暗黙カバレッジ）一方、暗黙的制約は依然として残る。欠陥注入実験では、ICGベースのスコアリングがチェックポイント方式の評価よりも2～16倍の割合で状態エラーを検出することが示された。

English

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.