基于时空似然的无训练生成视频检测

摘要

随着文本和图像生成领域的重大突破，视频生成领域迅速崛起，能够产出高度逼真且可控的连续画面。在这一进展的同时，此类模型也引发了关于虚假信息的严重担忧，使得可靠检测合成视频变得愈发关键。基于图像的检测器存在根本局限，因其仅能逐帧分析而忽略时序动态特征；而有监督视频检测器对未知生成模型的泛化能力较差——面对新型模型的快速涌现，这一缺陷尤为致命。这些挑战催生了零样本检测方法，其避免使用合成数据，转而通过比对真实数据统计特征来评估内容，从而实现无需训练、与模型无关的检测。我们提出STALL检测器，这种无需训练的简易方法具备理论依据，可在概率框架内联合建模空时证据，为视频提供基于似然度的评估分数。我们在两个公开基准上测试STALL，并推出包含前沿生成模型的新基准ComGenVid。实验表明STALL始终优于现有基于图像和视频的基线方法。代码与数据详见https://omerbenhayun.github.io/stall-video。

English

Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce STALL, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at https://omerbenhayun.github.io/stall-video.

基于时空似然的无训练生成视频检测

Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

摘要

Support