マルチモーダルLLMを用いたAI生成動画における人間が知覚する偽物感の学習

要旨

人間はAI生成（偽物）の動画を識別し、根拠のある理由を提供できるのか？動画生成モデルが急速に進化する中で、重要な次元――人間が生成された動画内のディープフェイクの痕跡、すなわち動画が機械生成であることを明らかにする時空間的に根拠のある視覚的アーティファクトを検出できるかどうか――はほとんど見過ごされてきた。我々はDeeptraceRewardを導入する。これは、人間が知覚する偽物の痕跡を動画生成の報酬として注釈付けする、初めての細粒度で空間的・時間的に認識されたベンチマークである。このデータセットは、3.3Kの高品質な生成動画にわたる4.3Kの詳細な注釈から構成される。各注釈は自然言語による説明を提供し、知覚された痕跡を含むバウンディングボックス領域を特定し、正確な開始と終了のタイムスタンプを記録する。我々はこれらの注釈を、人間が動画をAI生成と識別する原因となるディープフェイクの痕跡の9つの主要カテゴリに統合し、マルチモーダル言語モデル（LM）を報酬モデルとして訓練し、人間の判断と位置特定を模倣する。DeeptraceRewardにおいて、我々の7B報酬モデルは、偽物の手がかりの識別、根拠付け、説明においてGPT-5を平均34.7%上回った。興味深いことに、一貫した難易度の勾配が観察された：二値の偽物対本物の分類は、細粒度のディープフェイク痕跡検出よりも大幅に容易である；後者の中では、自然言語の説明（最も容易）から空間的根拠付け、時間的ラベリング（最も困難）へと性能が低下する。人間が知覚するディープフェイクの痕跡を前景化することにより、DeeptraceRewardは社会的に意識された信頼できる動画生成のための厳密なテストベッドと訓練信号を提供する。

English

Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension -- whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated -- has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.

マルチモーダルLLMを用いたAI生成動画における人間が知覚する偽物感の学習

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

要旨

Support