ChatPaper.aiChatPaper

IVY-FAKE:一個統一的圖像與視頻AIGC檢測可解釋性框架與基準

IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

June 1, 2025
作者: Wayne Zhang, Changjiang Jiang, Zhonghao Zhang, Chenyang Si, Fengchang Yu, Wei Peng
cs.AI

摘要

人工智慧生成內容(AIGC)在視覺領域的快速進展,已催生出高度逼真的合成圖像與視頻,這得益於如擴散模型等複雜生成框架的推動。儘管這些突破帶來了巨大的機遇,但同時也引發了關於內容真實性與完整性的重大擔憂。目前許多AIGC檢測方法僅作為黑箱式的二元分類器運作,其可解釋性有限,且尚無方法能在統一框架下同時檢測圖像與視頻。此雙重限制削弱了模型的透明度,降低了可信度,並阻礙了實際應用。為應對這些挑戰,我們推出了IVY-FAKE,這是一個專為可解釋的多模態AIGC檢測而設計的新穎、統一且大規模的數據集。與以往覆蓋模態零散、註釋稀疏的基準相比,IVY-FAKE包含了超過150,000個豐富註釋的訓練樣本(圖像與視頻)及18,700個評估樣例,每個樣本均附有超越簡單二元標籤的詳細自然語言推理。基於此,我們提出了Ivy可解釋檢測器(IVY-XDETECTOR),這是一種統一的AIGC檢測與可解釋架構,能夠對圖像與視頻內容進行聯合的可解釋檢測。我們的統一視覺-語言模型在多個圖像與視頻檢測基準上達到了最先進的性能,彰顯了我們數據集與建模框架所帶來的顯著進步。我們的數據已公開於https://huggingface.co/datasets/AI-Safeguard/Ivy-Fake。
English
The rapid advancement of Artificial Intelligence Generated Content (AIGC) in visual domains has resulted in highly realistic synthetic images and videos, driven by sophisticated generative frameworks such as diffusion-based architectures. While these breakthroughs open substantial opportunities, they simultaneously raise critical concerns about content authenticity and integrity. Many current AIGC detection methods operate as black-box binary classifiers, which offer limited interpretability, and no approach supports detecting both images and videos in a unified framework. This dual limitation compromises model transparency, reduces trustworthiness, and hinders practical deployment. To address these challenges, we introduce IVY-FAKE , a novel, unified, and large-scale dataset specifically designed for explainable multimodal AIGC detection. Unlike prior benchmarks, which suffer from fragmented modality coverage and sparse annotations, IVY-FAKE contains over 150,000 richly annotated training samples (images and videos) and 18,700 evaluation examples, each accompanied by detailed natural-language reasoning beyond simple binary labels. Building on this, we propose Ivy Explainable Detector (IVY-XDETECTOR), a unified AIGC detection and explainable architecture that jointly performs explainable detection for both image and video content. Our unified vision-language model achieves state-of-the-art performance across multiple image and video detection benchmarks, highlighting the significant advancements enabled by our dataset and modeling framework. Our data is publicly available at https://huggingface.co/datasets/AI-Safeguard/Ivy-Fake.
PDF133June 3, 2025