IWR-Bench：LVLMs能否從使用者互動影片中重建互動式網頁？

摘要

網頁轉程式碼任務要求模型能夠理解網頁的視覺呈現並生成相應的程式碼。然而，現有的基準測試主要集中於靜態截圖轉程式碼任務，從而忽視了現實世界網頁應用中至關重要的動態互動。為解決這一侷限，本文引入了IWR-Bench，這是一個新穎的基準測試，用於評估大型視覺語言模型（LVLMs）在從影片中重建互動網頁的能力。IWR-Bench包含來自100個真實網站的113個精心策劃的任務，涵蓋1,001個動作，並展示了多樣的互動複雜性（如網頁遊戲）、視覺風格和領域。與標準的網頁開發實踐保持一致，每個任務不僅包括使用者互動影片，還包含所有爬取的靜態資源（如圖片、影片）。該基準測試評估模型在兩個基本挑戰上的表現：從影片和資源中推斷互動邏輯的全面多模態推理，以及將此邏輯轉化為功能性程式碼的高級程式碼生成。採用代理作為評判框架的綜合指標系統，自動評估生成網頁的功能正確性和視覺保真度。對28個LVLMs的廣泛實驗揭示了一個重大挑戰：最佳模型的總體得分僅為36.35%，其中功能正確性（24.39% IFS）顯著落後於視覺保真度（64.25% VFS）。這些結果突顯了當前模型在推理時間動態性和合成事件驅動邏輯能力上的關鍵限制，確立了IWR-Bench作為視覺語言研究的一個具有挑戰性的前沿。基準測試和評估程式碼將公開提供。程式碼可在https://github.com/L-O-I/IWR-Bench獲取。

English

The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available. Code is available at https://github.com/L-O-I/IWR-Bench.

IWR-Bench：LVLMs能否從使用者互動影片中重建互動式網頁？

IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?

摘要

Support