ArtifactsBench：弥合LLM代码生成评估中的视觉交互鸿沟

摘要

大型语言模型（LLMs）的生成能力正迅速从静态代码扩展到动态、交互式的视觉产物。这一进展受到一个关键评估瓶颈的制约：现有基准测试主要关注算法正确性，而忽视了定义现代用户体验的视觉保真度和交互完整性。为弥合这一差距，我们推出了ArtifactsBench，一个用于自动化、多模态视觉代码生成评估的新基准和范式。我们的框架通过程序化渲染每个生成的产物，并通过时间序列截图捕捉其动态行为。这些视觉证据与源代码一同，由作为评判者的多模态大型语言模型（MLLM）进行评估，该模型严格遵循细粒度的任务清单，确保评分全面且可复现。我们构建了一个包含1,825项多样化任务的新基准，并对超过30个领先的LLMs进行了评估。我们的自动化评估与WebDev Arena（网页开发领域人类偏好的黄金标准）达到了惊人的94.4%排名一致性，与人类专家的两两一致性也超过90%。这确立了ArtifactsBench作为首个能可靠大规模自动化评估人类感知质量的框架。我们的分析绘制了一幅高分辨率的当前技术前沿图，揭示出通用模型往往优于领域专用模型。我们在https://artifactsbenchmark.github.io/开源了ArtifactsBench，包括基准、评估工具及基线结果，为社区提供了一个可扩展且精确的工具，以加速以用户为中心的生成模型的发展。

English

The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://artifactsbenchmark.github.io/, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.

ArtifactsBench：弥合LLM代码生成评估中的视觉交互鸿沟

ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

摘要

Support