WebRISE: 需求诱导的多模态大语言模型生成Web制品的状态评估

摘要

现有的MLLM生成网页工件的基准测试通过局部证据评估交互，却忽略了决定页面是否正常工作的需求驱动状态与转换。我们提出WebRISE，它将任务需求编译为交互契约图（ICGs），包含可观察状态、用户意图转换以及DOM/视觉断言，以实现与具体实现无关的浏览器执行。WebRISE涵盖五种输入模态（文本、Markdown、草图、图像、视频）下的442项任务，包含5,495个状态转换和5,271项需求检查，将用户明确表述的功能与隐式的产品级约束区分开来。在14种MLLM中，即使最强模型也仅达到65.6%的转换有效性和66.3%的需求覆盖率，且视觉质量无法作为行为表现的代理指标（Qwen3.6-35B-A3B在Markdown输入下视觉得分V=80.8，但转换得分T=15.5）。视频输入提供了最强的交互信号（隐式覆盖率相比文本提升+10.6个百分点），但隐式约束依然存在；缺陷注入实验表明，基于ICG的评分检测状态错误的效率是检查点式评估的2到16倍。

English

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.