ChatPaper.aiChatPaper

Skyra:基於實證偽影推理的AI生成影片偵測技術

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

December 17, 2025
作者: Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu
cs.AI

摘要

AI驅動的影片生成技術的濫用已引發嚴重的社會憂慮,凸顯了對可靠AI生成影片檢測器的迫切需求。然而現有方法大多侷限於二元分類,且缺乏可供人類解讀的必要解釋。本文提出Skyra——一個專用的多模態大型語言模型(MLLM),該模型能識別AI生成影片中人類可感知的視覺偽影,並將其作為檢測與解釋的實證依據。為實現此目標,我們構建了首個具備細粒度人工標註的大規模AI生成影片偽影數據集ViF-CoT-4K用於監督微調(SFT),進而開發出兩階段訓練策略,系統性提升模型在時空偽影感知、解釋能力及檢測準確度方面的表現。為全面評估Skyra,我們建立了包含逾十種頂尖影片生成器所產出的3K高質量樣本的基準測試集ViF-Bench。大量實驗表明,Skyra在多項基準測試中均超越現有方法,而我們的評估結果為推進可解釋性AI生成影片檢測提供了寶貴洞見。
English
The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.
PDF162December 19, 2025