時空間尤度に基づく生成動画の検出手法

要旨

テキスト生成や画像生成における大きな進歩に続き、映像分野も急速に発展し、非常に現実的で制御性の高い連続シーケンスを生成できるようになりました。この進展とともに、これらのモデルは誤った情報拡散への深刻な懸念も引き起こしており、合成映像の信頼性のある検出がますます重要になっています。画像ベースの検出器はフレーム単位で動作し時間的動態を無視するため、根本的に限界があります。一方、教師あり映像検出器は未見の生成モデルへの汎化性能が低く、新たなモデルが急速に出現する現状では重大な欠点です。これらの課題は、合成データを避け、代わりに実データの統計量に対してコンテンツを評価するゼロショットアプローチを後押ししており、学習不要でモデル非依存の検出を可能にします。本論文では、STALLを提案します。これは、確率的枠組み内で空間的・時間的証拠を統合的にモデリングし、映像に対して尤度ベースのスコアリングを提供する、シンプルで学習不要、かつ理論的根拠に基づく検出器です。STALLを2つの公開ベンチマークで評価し、最新の生成モデルを含む新たなベンチマークComGenVidを導入しました。STALLは、従来の画像ベースおよび映像ベースのベースライン手法を一貫して凌駕します。コードとデータは https://omerbenhayun.github.io/stall-video で公開されています。

English

Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce STALL, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at https://omerbenhayun.github.io/stall-video.

時空間尤度に基づく生成動画の検出手法

Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

要旨

Support