检测类似Sora的AI生成视频中的关键要点是什么？

摘要

最近扩散式视频生成的最新进展展示出了显著的成果，然而合成视频与真实世界视频之间的差距仍未被充分探讨。在这项研究中，我们从三个基本角度检验了这一差距：外观、运动和几何，将真实世界视频与由最先进的AI模型“稳定视频扩散”生成的视频进行比较。为实现这一目标，我们使用3D卷积网络训练了三个分类器，每个分类器针对不同的方面：外观使用视觉基础模型特征，运动使用光流，几何使用单目深度。每个分类器在伪造视频检测方面表现出强大的性能，无论是定性还是定量。这表明AI生成的视频仍然很容易被检测出来，真假视频之间存在显著差距。此外，通过使用Grad-CAM，我们可以准确定位AI生成视频在外观、运动和几何方面的系统性失败。最后，我们提出了一个“专家集成模型”，整合了外观、光流和深度信息用于伪造视频检测，从而提高了鲁棒性和泛化能力。我们的模型能够高准确度地检测由Sora生成的视频，即使在训练过程中没有接触任何Sora视频。这表明真实视频与伪造视频之间的差距可以泛化到各种视频生成模型。项目页面：https://justin-crchang.github.io/3DCNNDetection.github.io/

English

Recent advancements in diffusion-based video generation have showcased remarkable results, yet the gap between synthetic and real-world videos remains under-explored. In this study, we examine this gap from three fundamental perspectives: appearance, motion, and geometry, comparing real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. To achieve this, we train three classifiers using 3D convolutional networks, each targeting distinct aspects: vision foundation model features for appearance, optical flow for motion, and monocular depth for geometry. Each classifier exhibits strong performance in fake video detection, both qualitatively and quantitatively. This indicates that AI-generated videos are still easily detectable, and a significant gap between real and fake videos persists. Furthermore, utilizing the Grad-CAM, we pinpoint systematic failures of AI-generated videos in appearance, motion, and geometry. Finally, we propose an Ensemble-of-Experts model that integrates appearance, optical flow, and depth information for fake video detection, resulting in enhanced robustness and generalization ability. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training. This suggests that the gap between real and fake videos can be generalized across various video generative models. Project page: https://justin-crchang.github.io/3DCNNDetection.github.io/

检测类似Sora的AI生成视频中的关键要点是什么？

What Matters in Detecting AI-Generated Videos like Sora?

摘要

Support