Artifact-Bench: AI 생성 비디오의 인공물 탐지 및 평가에서 MLLMs 성능 평가

초록

최근 비디오 생성 모델들은 AI 생성 비디오의 현실감을 크게 향상시켰지만, 여전히 시간적 불일치, 구조적 왜곡, 의미적 비일관성과 같은 아티팩트가 출력물에 나타난다. 다중모달 대규모 언어 모델(MLLM)은 강력한 시각 이해 능력을 보여주지만, 이러한 아티팩트를 인지하고 추론하는 능력은 아직 명확하지 않다. 기존 벤치마크는 특히 포토리얼리스틱 콘텐츠를 넘어서는 다양한 AI 생성 비디오 도메인에 걸쳐 아티팩트 인식 능력과 세밀한 진단 추론에 대한 체계적인 평가가 부족한 경우가 많다. 이러한 격차를 해소하기 위해, 우리는 AI 생성 비디오 아티팩트 탐지 및 분석을 위한 MLLM 평가용 포괄적 벤치마크인 Artifact-Bench를 소개한다. 먼저, 포토리얼리스틱, 애니메이션, CG 스타일 비디오를 포괄하는 현실감 아티팩트의 3단계 계층적 분류 체계를 수립한다. 이 분류 체계에 기반하여 Artifact-Bench는 실제 vs AI 생성 비디오 분류, 쌍별 현실감 비교, 세밀한 아티팩트 식별이라는 세 가지 상호 보완적 작업을 정의한다. 19개의 주요 MLLM에 대한 실험 결과, 아티팩트 인식 및 추론에 상당한 한계가 드러났으며, 많은 모델이 까다로운 환경에서 무작위 수준에 근접하거나 심지어 그 이하의 성능을 보였다. 또한 MLLM의 판단과 인간의 지각 선호도 사이에 상당한 불일치가 관찰되어, AI 생성 비디오의 현실감에 대한 일반 평가자로서의 신뢰성이 제한적임을 시사한다.

English

Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy, Artifact-Bench defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or even below-random performance in challenging settings. We further observe significant misalignment between MLLM judgments and human perceptual preferences, highlighting their limited reliability as general evaluators for AI-generated video realism.