如何檢測像Sora這樣的AI生成視頻中的重要事項？

摘要

最近在基於擴散的視頻生成方面取得了顯著進展，然而合成視頻與真實世界視頻之間的差距仍未得到充分探討。在這項研究中，我們從三個基本角度檢視這一差距：外觀、運動和幾何，將真實世界視頻與由最先進的人工智慧模型「穩定視頻擴散」生成的視頻進行比較。為了實現這一目標，我們使用3D卷積網絡訓練了三個分類器，每個分類器針對不同的方面：外觀使用視覺基礎模型特徵，運動使用光流，幾何使用單眼深度。每個分類器在偽造視頻檢測方面表現出色，無論從質量還是量化方面。這表明人工智慧生成的視頻仍然很容易被檢測出來，真偽視頻之間存在顯著差距。此外，我們利用Grad-CAM，指出了人工智慧生成的視頻在外觀、運動和幾何方面的系統性失敗。最後，我們提出了一個「專家集成模型」，整合外觀、光流和深度信息用於偽造視頻檢測，從而提高了魯棒性和泛化能力。我們的模型能夠高準確度地檢測由Sora生成的視頻，即使在訓練期間沒有接觸任何Sora視頻。這表明真偽視頻之間的差距可以泛化到各種視頻生成模型。項目頁面：https://justin-crchang.github.io/3DCNNDetection.github.io/

English

Recent advancements in diffusion-based video generation have showcased remarkable results, yet the gap between synthetic and real-world videos remains under-explored. In this study, we examine this gap from three fundamental perspectives: appearance, motion, and geometry, comparing real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. To achieve this, we train three classifiers using 3D convolutional networks, each targeting distinct aspects: vision foundation model features for appearance, optical flow for motion, and monocular depth for geometry. Each classifier exhibits strong performance in fake video detection, both qualitatively and quantitatively. This indicates that AI-generated videos are still easily detectable, and a significant gap between real and fake videos persists. Furthermore, utilizing the Grad-CAM, we pinpoint systematic failures of AI-generated videos in appearance, motion, and geometry. Finally, we propose an Ensemble-of-Experts model that integrates appearance, optical flow, and depth information for fake video detection, resulting in enhanced robustness and generalization ability. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training. This suggests that the gap between real and fake videos can be generalized across various video generative models. Project page: https://justin-crchang.github.io/3DCNNDetection.github.io/

如何檢測像Sora這樣的AI生成視頻中的重要事項？

What Matters in Detecting AI-Generated Videos like Sora?

摘要

Support