WebVR: 人間整合型視覚評価基準によるビデオからのウェブページ再現のためのマルチモーダルLLMベンチマーク

要旨

既存のWeb生成ベンチマークは、テキストプロンプトや静止画スクリーンショットを入力として依存している。しかし、動画は自然に、インタラクションフロー、遷移タイミング、動きの連続性といったより豊かな信号を伝達し、忠実なWebページ再現に不可欠である。この可能性にもかかわらず、動画を条件としたWebページ生成はほとんど未開拓のままであり、このタスク専用のベンチマークも存在しない。このギャップを埋めるため、我々はWebVRを提案する。これはMLLMが実演動画からWebページを忠実に再現できるかを評価するベンチマークである。WebVRは多様なカテゴリにわたる175のWebページを含み、これらは全てWebクローリングではなく制御された合成パイプラインを通じて構築されており、既存のオンラインページとの重複なく、多様で現実的な実演を保証する。さらに、生成されたWebページを多次元にわたって評価する、人間の判断に沿った細粒度の視覚的評価基準を設計した。19のモデルを用いた実験では、細かなスタイルや動きの品質の再現において大きな隔たりが明らかになった一方、評価基準に基づく自動評価は人間の選好と96%の一致率を達成した。今後の動画からWebページ生成に関する研究を支援するため、データセット、評価ツールキット、およびベースライン結果を公開する。

English

Existing web-generation benchmarks rely on text prompts or static screenshots as input. However, videos naturally convey richer signals such as interaction flow, transition timing, and motion continuity, which are essential for faithful webpage recreation. Despite this potential, video-conditioned webpage generation remains largely unexplored, with no dedicated benchmark for this task. To fill this gap, we introduce WebVR, a benchmark that evaluates whether MLLMs can faithfully recreate webpages from demonstration videos. WebVR contains 175 webpages across diverse categories, all constructed through a controlled synthesis pipeline rather than web crawling, ensuring varied and realistic demonstrations without overlap with existing online pages. We also design a fine-grained, human-aligned visual rubric that evaluates the generated webpages across multiple dimensions. Experiments on 19 models reveal substantial gaps in recreating fine-grained style and motion quality, while the rubric-based automatic evaluation achieves 96% agreement with human preferences. We release the dataset, evaluation toolkit, and baseline results to support future research on video-to-webpage generation.

WebVR: 人間整合型視覚評価基準によるビデオからのウェブページ再現のためのマルチモーダルLLMベンチマーク

WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

要旨

Support