ChatPaper.aiChatPaper

ArtifactsBench:彌合LLM代碼生成評估中的視覺交互鴻溝

ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

July 7, 2025
作者: Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Shihui Hu, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Haotian Zhu, Yuanxing Zhang, Yuhao Jiang, Yue Zhang, Zenan Xu, Bohui Zhai, Guoxiang He, Hebin Li, Jie Zhao, Le Zhang, Lingyun Tan, Pengyu Guo, Xianshu Pang, Yang Ruan, Zhifeng Zhang, Zhonghu Wang, Ziyan Xu, Zuopu Yin, Wiggin Zhou, Chayse Zhou, Fengzong Lian
cs.AI

摘要

大型語言模型(LLMs)的生成能力正迅速從靜態代碼擴展到動態、互動的視覺產物。這一進展受到一個關鍵評估瓶頸的制約:現有的基準測試主要關注算法的正確性,而忽視了定義現代用戶體驗的視覺保真度和互動完整性。為彌補這一差距,我們引入了ArtifactsBench,這是一個用於視覺代碼生成自動化、多模態評估的新基準和範式。我們的框架通過程序化渲染每個生成的產物,並通過時間序列的屏幕截圖捕捉其動態行為。這些視覺證據與源代碼一起,由一個多模態LLM(MLLM)作為評判者進行評估,該評判者嚴格遵循細緻的任務檢查表,以確保全面且可重複的評分。我們構建了一個包含1,825個多樣化任務的新基準,並評估了超過30個領先的LLMs。我們的自動化評估與WebDev Arena(網頁開發領域人類偏好的黃金標準)達到了驚人的94.4%排名一致性,並與人類專家達成了超過90%的成對一致性。這使得ArtifactsBench成為首個能夠可靠地大規模自動化評估人類感知質量的框架。我們的分析提供了當前技術水平的高分辨率圖譜,揭示了通用模型往往優於特定領域模型。我們開源了ArtifactsBench,包括基準、評估工具和基線結果,網址為https://artifactsbenchmark.github.io/,為社區提供了一個可擴展且準確的工具,以加速以用戶為中心的生成模型的開發。
English
The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://artifactsbenchmark.github.io/, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.
PDF81July 8, 2025