ChatPaper.aiChatPaper

Vision2Web:基于智能体验证的层次化视觉网站开发基准

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

March 27, 2026
作者: Zehai He, Wenyi Hong, Zhen Yang, Ziyang Pan, Mingdao Liu, Xiaotao Gu, Jie Tang
cs.AI

摘要

近期大语言模型的进展显著提升了代码智能体的能力,但针对复杂端到端网站开发的系统性评估仍显不足。为填补这一空白,我们推出Vision2Web——一个面向可视化网站开发的分层基准测试框架,涵盖从静态UI转代码生成、交互式多页面前端复现,到长周期全栈网站开发的全流程。该基准基于真实网站构建,包含16个类别共计193项任务,涉及918张原型图和1255个测试用例。为支持灵活、全面且可靠的评估,我们提出基于工作流的智能体验证范式,该范式由两个互补组件构成:GUI智能体验证器和基于视觉语言模型的评判器。通过对不同代码智能体框架下的多模态视觉语言模型进行评估,我们发现所有任务层级均存在显著性能差距,即使最先进的模型在全栈开发任务上仍面临挑战。
English
Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.
PDF322April 3, 2026