Vision2Web: 에이전트 검증을 통한 시각적 웹사이트 개발을 위한 계층적 벤치마크

초록

대규모 언어 모델의 최근 발전으로 코딩 에이전트의 성능이 향상되었으나, 복잡한 종단간 웹사이트 개발에 대한 체계적인 평가는 여전히 제한적입니다. 이러한 격차를 해소하기 위해 우리는 시각적 웹사이트 개발을 위한 계층적 벤치마크인 Vision2Web을 소개합니다. 이 벤치마크는 정적 UI-to-코드 생성, 인터랙티브 다중 페이지 프론트엔드 재현, 그리고 장기간의 풀스택 웹사이트 개발에 이르기까지 다양한 수준을 포괄합니다. 해당 벤치마크는 실제 웹사이트를 바탕으로 구축되었으며 16개 범주에 걸친 총 193개 작업, 918개의 프로토타입 이미지 및 1,255개의 테스트 케이스로 구성됩니다. 유연하고 철저하며 신뢰할 수 있는 평가를 지원하기 위해, 우리는 두 가지 상호 보완적인 구성 요소인 GUI 에이전트 검증기와 VLM 기반 판단 모델을 기반으로 한 워크플로우 기반 에이전트 검증 패러다임을 제안합니다. 우리는 다양한 코딩-에이전트 프레임워크 하에서 구현된 여러 시각 언어 모델을 평가하였으며, 모든 작업 수준에서 상당한 성능 격차가 존재하며 최첨단 모델들도 풀스택 개발에서는 여전히 어려움을 겪고 있음을 확인했습니다.

English

Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.

Vision2Web: 에이전트 검증을 통한 시각적 웹사이트 개발을 위한 계층적 벤치마크

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

초록

Support