WebRISE：針對MLLM生成之網路工件之需求驅動狀態評估

摘要

現有針對多模態大語言模型（MLLM）生成網頁產品的基準測試，僅透過局部證據評估互動表現，未能捕捉決定網頁功能性的需求驅動狀態與轉換。本文提出WebRISE框架，將任務需求編譯為可觀測狀態、使用者意圖轉換及DOM/視覺斷言組成的互動合約圖（ICG），實現與實作無關的瀏覽器執行。WebRISE涵蓋442項任務，橫跨五種輸入模態（文字、Markdown、草圖、圖像、影片），包含5,495個狀態轉換與5,271項需求驗證，明確區分使用者陳述功能與隱含產品層級約束。在14個MLLM測試中，最強模型僅達65.6%轉換有效性與66.3%需求覆蓋率，且視覺品質無法反映行為表現（Markdown輸入下Qwen3.6-35B-A3B的V值達80.8，T值僅15.5）。影片提供最強互動訊號（隱含覆蓋率較文字提升10.6個百分點），但隱含約束仍持續存在；錯誤注入測試顯示，ICG基評分偵測狀態錯誤的效率為檢查點式評估的2至16倍。

English

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.