基于时空似然的无训练生成视频检测方法

摘要

随着文本和图像生成技术的重大突破，视频生成领域迅猛发展，已能产出高度逼真且可控的序列。然而这些模型的进步也引发了关于虚假信息的严重担忧，使得可靠检测合成视频变得愈发关键。基于图像的检测器存在根本局限——它们逐帧处理而忽略时序动态；有监督视频检测器则对未知生成器的泛化能力较差，这在新型模型快速涌现的背景下成为致命缺陷。这些挑战催生了零样本检测方法，其避免使用合成数据，转而通过比对真实数据统计特征进行内容评分，实现无需训练、与模型无关的检测。我们提出STALL检测器：一种简单无需训练、具有理论依据的方案，可在概率框架内联合建模时空特征，为视频提供基于似然度的评分。我们在两个公开基准上评估STALL，并推出包含前沿生成模型的新基准ComGenVid。实验表明STALL持续优于现有图像与视频基线方法。代码与数据详见https://omerbenhayun.github.io/stall-video。

English

Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce STALL, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at https://omerbenhayun.github.io/stall-video.