SoraのようなAI生成動画を検出する上で重要な要素は何か？

要旨

拡散モデルに基づくビデオ生成の最近の進展は目覚ましい成果を示しているが、合成ビデオと実世界のビデオの間のギャップはまだ十分に探求されていない。本研究では、このギャップを外観、動き、幾何学の3つの基本的な観点から検証し、実世界のビデオと最先端のAIモデルであるStable Video Diffusionによって生成されたビデオを比較する。これを実現するため、3D畳み込みネットワークを使用して3つの分類器を訓練し、それぞれ外観には視覚基盤モデルの特徴、動きにはオプティカルフロー、幾何学には単眼深度をターゲットとする。各分類器は、質的および量的に偽ビデオ検出において高い性能を示す。これは、AI生成ビデオが依然として容易に検出可能であり、実ビデオと偽ビデオの間には依然として大きなギャップが存在することを示唆している。さらに、Grad-CAMを利用して、AI生成ビデオの外観、動き、幾何学における体系的な失敗を特定する。最後に、外観、オプティカルフロー、深度情報を統合したエキスパートアンサンブルモデルを提案し、偽ビデオ検出のための堅牢性と汎化能力を向上させる。我々のモデルは、訓練中にSoraのビデオに一切触れることなく、Soraによって生成されたビデオを高精度で検出することができる。これは、実ビデオと偽ビデオの間のギャップが様々なビデオ生成モデルにわたって一般化可能であることを示唆している。プロジェクトページ: https://justin-crchang.github.io/3DCNNDetection.github.io/

English

Recent advancements in diffusion-based video generation have showcased remarkable results, yet the gap between synthetic and real-world videos remains under-explored. In this study, we examine this gap from three fundamental perspectives: appearance, motion, and geometry, comparing real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. To achieve this, we train three classifiers using 3D convolutional networks, each targeting distinct aspects: vision foundation model features for appearance, optical flow for motion, and monocular depth for geometry. Each classifier exhibits strong performance in fake video detection, both qualitatively and quantitatively. This indicates that AI-generated videos are still easily detectable, and a significant gap between real and fake videos persists. Furthermore, utilizing the Grad-CAM, we pinpoint systematic failures of AI-generated videos in appearance, motion, and geometry. Finally, we propose an Ensemble-of-Experts model that integrates appearance, optical flow, and depth information for fake video detection, resulting in enhanced robustness and generalization ability. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training. This suggests that the gap between real and fake videos can be generalized across various video generative models. Project page: https://justin-crchang.github.io/3DCNNDetection.github.io/

SoraのようなAI生成動画を検出する上で重要な要素は何か？

What Matters in Detecting AI-Generated Videos like Sora?

要旨

Support