WebVR：基于人机协同视觉量规的视频网页重建多模态大模型评测体系

摘要

现有网页生成基准主要依赖文本提示或静态截图作为输入。然而视频天然蕴含更丰富的信号，如交互流程、转场时机与运动连续性，这些对精准还原网页至关重要。尽管存在这种潜力，基于视频条件的网页生成研究仍处于空白状态，缺乏专门针对该任务的评估基准。为此，我们推出WebVR基准，用于评估多模态大语言模型能否根据演示视频准确复现网页。WebVR涵盖175个跨领域网页样本，全部通过可控合成流程构建而非网络爬取，确保演示内容的多样性和真实性，且与现有在线页面无重叠。我们还设计了细粒度、符合人类偏好的视觉评估标准，从多维度对生成网页进行量化评价。在19个模型上的实验表明，现有系统在还原细粒度样式和动效质量方面存在显著差距，而基于量规的自动评估与人类偏好的吻合度达到96%。我们公开数据集、评估工具包及基线结果，以支持视频到网页生成领域的后续研究。

English

Existing web-generation benchmarks rely on text prompts or static screenshots as input. However, videos naturally convey richer signals such as interaction flow, transition timing, and motion continuity, which are essential for faithful webpage recreation. Despite this potential, video-conditioned webpage generation remains largely unexplored, with no dedicated benchmark for this task. To fill this gap, we introduce WebVR, a benchmark that evaluates whether MLLMs can faithfully recreate webpages from demonstration videos. WebVR contains 175 webpages across diverse categories, all constructed through a controlled synthesis pipeline rather than web crawling, ensuring varied and realistic demonstrations without overlap with existing online pages. We also design a fine-grained, human-aligned visual rubric that evaluates the generated webpages across multiple dimensions. Experiments on 19 models reveal substantial gaps in recreating fine-grained style and motion quality, while the rubric-based automatic evaluation achieves 96% agreement with human preferences. We release the dataset, evaluation toolkit, and baseline results to support future research on video-to-webpage generation.