ChatPaper.aiChatPaper

WebRISE:針對MLLM生成之網路工件之需求驅動狀態評估

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

June 2, 2026
作者: Yuxin Meng, Yuhan Suo, Junjie Wang, Yuhan Sun, Yiyao Yu, Ruixu Zhang, Ruining Hu, Yubin Wang, Shouwei Ruan, Bin Wang, Yuxiang Zhang, Yujiu Yang
cs.AI

摘要

現有針對多模態大語言模型(MLLM)生成網頁產品的基準測試,僅透過局部證據評估互動表現,未能捕捉決定網頁功能性的需求驅動狀態與轉換。本文提出WebRISE框架,將任務需求編譯為可觀測狀態、使用者意圖轉換及DOM/視覺斷言組成的互動合約圖(ICG),實現與實作無關的瀏覽器執行。WebRISE涵蓋442項任務,橫跨五種輸入模態(文字、Markdown、草圖、圖像、影片),包含5,495個狀態轉換與5,271項需求驗證,明確區分使用者陳述功能與隱含產品層級約束。在14個MLLM測試中,最強模型僅達65.6%轉換有效性與66.3%需求覆蓋率,且視覺品質無法反映行為表現(Markdown輸入下Qwen3.6-35B-A3B的V值達80.8,T值僅15.5)。影片提供最強互動訊號(隱含覆蓋率較文字提升10.6個百分點),但隱含約束仍持續存在;錯誤注入測試顯示,ICG基評分偵測狀態錯誤的效率為檢查點式評估的2至16倍。
English
Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.