Artifact-Bench：多模态大语言模型对AI生成视频伪像的检测与评估基准

摘要

近年来，视频生成模型大幅提升了AI生成视频的逼真度，但其输出仍存在时间不一致、结构扭曲和语义不连贯等伪影。尽管多模态大语言模型（MLLMs）展现出强大的视觉理解能力，但它们感知和推理此类伪影的能力尚不明确。现有基准测试往往缺乏对伪影感知能力的系统性评估，以及细粒度的诊断性推理能力，尤其在覆盖超写实内容以外的多类型AI生成视频领域存在不足。为填补这一空白，我们提出Artifact-Bench——一个用于评估MLLMs在AI生成视频伪影检测与分析方面能力的综合性基准。首先，我们建立了一个三级层次化逼真度伪影分类体系，涵盖写实、动画和CG风格视频。基于此分类体系，Artifact-Bench定义了三个互补任务：真实视频与AI生成视频分类、成对逼真度比较以及细粒度伪影识别。在19个主流MLLMs上的实验揭示了它们在伪影感知与推理方面的严重局限性，许多模型在具有挑战性的场景下表现趋近随机甚至低于随机水平。此外，我们观察到MLLMs判断与人类感知偏好之间存在显著偏差，这凸显了其作为AI生成视频逼真度通用评估器的可靠性有限。

English

Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy, Artifact-Bench defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or even below-random performance in challenging settings. We further observe significant misalignment between MLLM judgments and human perceptual preferences, highlighting their limited reliability as general evaluators for AI-generated video realism.