Vision2Web: エージェント検証を備えた視覚的Webサイト開発のための階層的ベンチマーク

要旨

大規模言語モデルの最近の進歩により、コーディングエージェントの能力は向上しているものの、複雑なエンドツーエンドのウェブサイト開発に関する体系的な評価は依然として限られている。この課題に対処するため、我々は視覚的ウェブサイト開発のための階層的ベンチマーク「Vision2Web」を提案する。これは静的UIからコード生成、インタラクティブなマルチページフロントエンド再現、長期的なフルスタックウェブサイト開発までを網羅する。このベンチマークは実世界のウェブサイトから構築され、16カテゴリーにわたる合計193タスク、918のプロトタイプ画像、1,255のテストケースで構成される。柔軟かつ徹底的で信頼性の高い評価を支援するため、GUIエージェント検証器とVLMベースの判定器という2つの相補的コンポーネントに基づくワークフローベースのエージェント検証パラダイムを提案する。様々なコーディングエージェントフレームワークで実体化された複数の視覚言語モデルを評価した結果、あらゆるタスクレベルで大きな性能差が確認され、最先端モデルでもフルスタック開発には苦戦していることが明らかになった。

English

Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.

Vision2Web: エージェント検証を備えた視覚的Webサイト開発のための階層的ベンチマーク

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

要旨

Support