Artifact-Bench: AI生成ビデオのアーティファクト検出および評価におけるMLLMの評価

要旨

最近のビデオ生成モデルにより、AI生成動画のリアリズムは大幅に向上したが、それでも時間的不整合、構造的歪み、意味的不整合などのアーティファクトが出力に現れる。マルチモーダル大規模言語モデル（MLLM）は強力な視覚理解能力を示すものの、そのようなアーティファクトを知覚し推論する能力は依然として不明確である。既存のベンチマークは、特にフォトリアリスティックなコンテンツを超えた多様なAI生成動画領域において、アーティファクト認識の体系的な評価や詳細な診断的推論が不足していることが多い。このギャップを埋めるため、我々はArtifact-Benchを導入する。これは、AI生成動画のアーティファクト検出と分析におけるMLLMを評価するための包括的なベンチマークである。まず、フォトリアリスティック、アニメーション、CGスタイルの動画を網羅する、3レベルの階層的なリアリズムアーティファクト分類法を確立する。この分類法に基づき、Artifact-Benchは3つの相補的なタスクを定義する：実動画とAI生成動画の分類、ペアワイズリアリズム比較、および詳細なアーティファクト識別である。19の主要なMLLMを用いた実験では、アーティファクトの知覚と推論において顕著な限界が明らかになり、多くのモデルが困難な設定でランダムに近い、あるいはランダム以下のパフォーマンスを示した。さらに、MLLMの判断と人間の知覚嗜好との間に著しい不一致が観察され、AI生成動画のリアリズムに対する汎用的評価器としての信頼性が限定的であることが浮き彫りとなった。

English

Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy, Artifact-Bench defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or even below-random performance in challenging settings. We further observe significant misalignment between MLLM judgments and human perceptual preferences, highlighting their limited reliability as general evaluators for AI-generated video realism.