ArtifactsBench:弥合LLM代码生成评估中的视觉交互鸿沟
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
July 7, 2025
作者: Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Shihui Hu, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Haotian Zhu, Yuanxing Zhang, Yuhao Jiang, Yue Zhang, Zenan Xu, Bohui Zhai, Guoxiang He, Hebin Li, Jie Zhao, Le Zhang, Lingyun Tan, Pengyu Guo, Xianshu Pang, Yang Ruan, Zhifeng Zhang, Zhonghu Wang, Ziyan Xu, Zuopu Yin, Wiggin Zhou, Chayse Zhou, Fengzong Lian
cs.AI
摘要
大型语言模型(LLMs)的生成能力正迅速从静态代码扩展到动态、交互式的视觉产物。这一进展受到一个关键评估瓶颈的制约:现有基准测试主要关注算法正确性,而忽视了定义现代用户体验的视觉保真度和交互完整性。为弥合这一差距,我们推出了ArtifactsBench,一个用于自动化、多模态视觉代码生成评估的新基准和范式。我们的框架通过程序化渲染每个生成的产物,并通过时间序列截图捕捉其动态行为。这些视觉证据与源代码一同,由作为评判者的多模态大型语言模型(MLLM)进行评估,该模型严格遵循细粒度的任务清单,确保评分全面且可复现。我们构建了一个包含1,825项多样化任务的新基准,并对超过30个领先的LLMs进行了评估。我们的自动化评估与WebDev Arena(网页开发领域人类偏好的黄金标准)达到了惊人的94.4%排名一致性,与人类专家的两两一致性也超过90%。这确立了ArtifactsBench作为首个能可靠大规模自动化评估人类感知质量的框架。我们的分析绘制了一幅高分辨率的当前技术前沿图,揭示出通用模型往往优于领域专用模型。我们在https://artifactsbenchmark.github.io/开源了ArtifactsBench,包括基准、评估工具及基线结果,为社区提供了一个可扩展且精确的工具,以加速以用户为中心的生成模型的发展。
English
The generative capabilities of Large Language Models (LLMs) are rapidly
expanding from static code to dynamic, interactive visual artifacts. This
progress is bottlenecked by a critical evaluation gap: established benchmarks
focus on algorithmic correctness and are blind to the visual fidelity and
interactive integrity that define modern user experiences. To bridge this gap,
we introduce ArtifactsBench, a new benchmark and paradigm for the automated,
multimodal evaluation of visual code generation. Our framework programmatically
renders each generated artifact and captures its dynamic behavior through
temporal screenshots. This visual evidence, alongside the source code, is then
assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a
fine-grained, per-task checklist to ensure holistic and reproducible scoring.
We construct a new benchmark of 1,825 diverse tasks and evaluate over 30
leading LLMs. Our automated evaluation achieves a striking 94.4% ranking
consistency with WebDev Arena, the gold-standard for human preference in web
development, and over 90% pairwise agreement with human experts. This
establishes ArtifactsBench as the first framework to reliably automate the
assessment of human-perceived quality at scale. Our analysis provides a
high-resolution map of the current SOTA, revealing that generalist models often
outperform domain-specific ones. We open-source ArtifactsBench, including the
benchmark, evaluation harness, and baseline results at
https://artifactsbenchmark.github.io/, to provide the community with a scalable
and accurate tool to accelerate the development of user-centric generative
models.