ChatPaper.aiChatPaper

Skyra:基于实体伪影推理的AI生成视频检测

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

December 17, 2025
作者: Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu
cs.AI

摘要

AI驱动的视频生成技术滥用已引发严重社会担忧,凸显了对可靠AI生成视频检测器的迫切需求。然而现有方法大多局限于二元分类,缺乏可供人类理解的必要解释。本文提出Skyra——一个专精的多模态大语言模型(MLLM),能够识别AI生成视频中人类可感知的视觉伪影,并将其作为检测与解释的实证依据。为实现这一目标,我们构建了包含细粒度人工标注的首个大规模AI生成视频伪影数据集ViF-CoT-4K用于监督微调(SFT),进而开发出两阶段训练策略,系统提升模型的时空伪影感知、解释能力及检测精度。为全面评估Skyra,我们推出包含十余种前沿视频生成器产生的3K高质量样本的基准测试集ViF-Bench。大量实验表明,Skyra在多个基准测试中超越现有方法,同时我们的评估为推进可解释AI生成视频检测提供了宝贵洞见。
English
The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.
PDF162December 19, 2025