ChatPaper.aiChatPaper

WebRISE: 需求诱导的多模态大语言模型生成Web制品的状态评估

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

June 2, 2026
作者: Yuxin Meng, Yuhan Suo, Junjie Wang, Yuhan Sun, Yiyao Yu, Ruixu Zhang, Ruining Hu, Yubin Wang, Shouwei Ruan, Bin Wang, Yuxiang Zhang, Yujiu Yang
cs.AI

摘要

现有的MLLM生成网页工件的基准测试通过局部证据评估交互,却忽略了决定页面是否正常工作的需求驱动状态与转换。我们提出WebRISE,它将任务需求编译为交互契约图(ICGs),包含可观察状态、用户意图转换以及DOM/视觉断言,以实现与具体实现无关的浏览器执行。WebRISE涵盖五种输入模态(文本、Markdown、草图、图像、视频)下的442项任务,包含5,495个状态转换和5,271项需求检查,将用户明确表述的功能与隐式的产品级约束区分开来。在14种MLLM中,即使最强模型也仅达到65.6%的转换有效性和66.3%的需求覆盖率,且视觉质量无法作为行为表现的代理指标(Qwen3.6-35B-A3B在Markdown输入下视觉得分V=80.8,但转换得分T=15.5)。视频输入提供了最强的交互信号(隐式覆盖率相比文本提升+10.6个百分点),但隐式约束依然存在;缺陷注入实验表明,基于ICG的评分检测状态错误的效率是检查点式评估的2到16倍。
English
Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.