WebVR：基于人本視覺評量標準的影片轉網頁多模態大語言模型基準測試

摘要

现有网页生成基准主要依赖文本提示或静态截图作为输入。然而视频天然蕴含更丰富的信号，如交互流程、转场时机和动效连续性，这些对于精准还原网页至关重要。尽管存在这种潜力，基于视频条件的网页生成研究仍处于探索空白阶段，目前尚无针对该任务的专用基准。为填补这一空白，我们推出WebVR基准，用于评估多模态大语言模型能否根据演示视频精准复现网页。WebVR涵盖多元类别的175个网页，全部通过可控合成流程构建（而非网络爬取），确保演示内容的多样性和真实性，且与现有在线网页无重叠。我们还设计了细粒度、符合人类偏好的视觉评估标准，从多维度对生成网页进行评测。对19个模型的实验表明，现有模型在还原细粒度样式和动效质量方面存在显著差距，而基于本评估标准的自动评测与人类偏好的一致性达到96%。我们公开数据集、评估工具包和基线结果，以支持未来视频到网页生成的研究。

English

Existing web-generation benchmarks rely on text prompts or static screenshots as input. However, videos naturally convey richer signals such as interaction flow, transition timing, and motion continuity, which are essential for faithful webpage recreation. Despite this potential, video-conditioned webpage generation remains largely unexplored, with no dedicated benchmark for this task. To fill this gap, we introduce WebVR, a benchmark that evaluates whether MLLMs can faithfully recreate webpages from demonstration videos. WebVR contains 175 webpages across diverse categories, all constructed through a controlled synthesis pipeline rather than web crawling, ensuring varied and realistic demonstrations without overlap with existing online pages. We also design a fine-grained, human-aligned visual rubric that evaluates the generated webpages across multiple dimensions. Experiments on 19 models reveal substantial gaps in recreating fine-grained style and motion quality, while the rubric-based automatic evaluation achieves 96% agreement with human preferences. We release the dataset, evaluation toolkit, and baseline results to support future research on video-to-webpage generation.

WebVR：基于人本視覺評量標準的影片轉網頁多模態大語言模型基準測試

WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

摘要

Support