Artifact-Bench：評估多模態大語言模型對AI生成影片偽影之檢測與評估

摘要

近期視頻生成模型大幅提升了AI生成視頻的真實感，但其輸出仍存在時間不一致性、結構失真與語義不連貫等偽影。儘管多模態大型語言模型（MLLMs）展現了強大的視覺理解能力，但其對這類偽影的感知與推理能力仍不明確。現有基準測試往往缺乏對偽影感知與細粒度診斷推理的系統性評估，尤其缺乏對超越照片級真實感內容的多樣化AI生成視頻領域的評估。為填補這一空白，我們提出Artifact-Bench，一個用於評估多模態大型語言模型（MLLMs）在AI生成視頻偽影檢測與分析方面能力的綜合基準測試。我們首先建立了一個三層級的現實偽影層級式分類體系，涵蓋照片級真實感、動畫及CG風格視頻。基於此分類體系，Artifact-Bench定義了三項互補任務：真實與AI生成視頻分類、成對真實感比較，以及細粒度偽影識別。對19個領先多模態大型語言模型（MLLMs）的實驗揭示了其在偽影感知與推理方面的顯著局限，多個模型在具有挑戰性的設定下表現接近甚至低於隨機水平。我們進一步觀察到多模態大型語言模型（MLLMs）判斷與人類感知偏好之間存在顯著不一致，凸顯了其作為AI生成視頻真實感通用評估器的可靠性有限。

English

Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy, Artifact-Bench defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or even below-random performance in challenging settings. We further observe significant misalignment between MLLM judgments and human perceptual preferences, highlighting their limited reliability as general evaluators for AI-generated video realism.