IWR-Bench：LVLMs能否从用户交互视频中重建交互式网页？

摘要

网页到代码的转换任务要求模型能够理解网页的视觉呈现并生成相应的代码。然而，现有的基准测试主要集中于静态截图到代码的任务，从而忽视了现实世界网络应用中至关重要的动态交互。为应对这一局限，本文引入了IWR-Bench，一个新颖的基准测试，用于评估大型视觉语言模型（LVLMs）在从视频中重建交互式网页方面的能力。IWR-Bench包含从100个真实网站中精心挑选的113项任务，涉及1,001个动作，并展现了多样的交互复杂度（如网页游戏）、视觉风格及领域。遵循标准网页开发实践，每项任务不仅包含用户交互视频，还囊括了所有抓取的静态资源（如图片、视频）。该基准测试评估模型在两大基本挑战上的表现：一是从视频和资源中推断交互逻辑的全面多模态推理能力，二是将这一逻辑转化为功能代码的高级代码生成能力。采用“代理即裁判”框架及一套综合指标系统，自动评估生成网页的功能正确性和视觉保真度。对28个LVLMs的广泛实验揭示了一个显著挑战：最佳模型的总体得分仅为36.35%，其中功能正确性（24.39% IFS）远落后于视觉保真度（64.25% VFS）。这些结果凸显了当前模型在推理时间动态性和合成事件驱动逻辑方面的关键局限，确立了IWR-Bench作为视觉语言研究领域的一个艰巨前沿。基准测试及评估代码将公开提供，代码可见于https://github.com/L-O-I/IWR-Bench。

English

The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available. Code is available at https://github.com/L-O-I/IWR-Bench.

IWR-Bench：LVLMs能否从用户交互视频中重建交互式网页？

IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?

摘要

Support