Vision2Web:基於代理驗證的視覺化網站開發分層基準
Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification
March 27, 2026
作者: Zehai He, Wenyi Hong, Zhen Yang, Ziyang Pan, Mingdao Liu, Xiaotao Gu, Jie Tang
cs.AI
摘要
近期大型語言模型的進步顯著提升了程式碼生成代理的能力,然而針對複雜端到端網站開發的系統性評估仍顯不足。為填補此空白,我們推出Vision2Web——一個用於視覺化網站開發的階層式基準測試框架,其範疇涵蓋靜態介面轉程式碼生成、互動式多頁面前端重現,以及長週期全端網站開發。該基準測試集源自真實網站資料,共包含16個類別的193項任務、918張原型圖像及1,255個測試案例。為支持靈活、全面且可靠的評估,我們基於GUI代理驗證器與視覺語言模型判別器的雙重互補組件,提出工作流程驅動的代理驗證範式。透過在不同程式碼代理框架下對多個視覺語言模型進行實證評估,我們發現所有任務層級均存在顯著效能落差,即便是最先進的模型在全端開發任務上仍面臨挑戰。
English
Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.