透過多模態大型語言模型學習人類感知的AI生成影片虛假性
Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs
September 26, 2025
作者: Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, Keyush Shah, Chung Un Lee, Yejin Choi, James Zou, Dan Roth, Chris Callison-Burch
cs.AI
摘要
人類能否識別AI生成的(偽造)影片並提供具體理由?
儘管影片生成模型已迅速進步,但一個關鍵維度——即人類能否在生成的影片中檢測到深度偽造的痕跡,也就是揭示影片為機器生成的時空基礎視覺偽影——在很大程度上被忽視了。我們引入了DeeptraceReward,這是第一個細粒度、空間與時間感知的基準,它為影片生成獎勵標註了人類感知到的偽造痕跡。該數據集包含對3.3K高質量生成影片的4.3K詳細註解。每個註解提供自然語言解釋,精確指出包含感知痕跡的邊界框區域,並標記精確的起始和結束時間戳。我們將這些註解整合為9大類導致人類識別影片為AI生成的深度偽造痕跡,並訓練多模態語言模型(LMs)作為獎勵模型,以模仿人類的判斷和定位。在DeeptraceReward上,我們的7B獎勵模型在偽造線索識別、定位和解釋方面平均比GPT-5高出34.7%。有趣的是,我們觀察到一致的難度梯度:二元的偽造與真實分類比細粒度的深度偽造痕跡檢測要容易得多;在後者中,從自然語言解釋(最易)、到空間定位、再到時間標註(最難),性能逐漸下降。通過突出人類感知的深度偽造痕跡,DeeptraceReward為社會意識和可信賴的影片生成提供了一個嚴格的測試平台和訓練信號。
English
Can humans identify AI-generated (fake) videos and provide grounded reasons?
While video generation models have advanced rapidly, a critical dimension --
whether humans can detect deepfake traces within a generated video, i.e.,
spatiotemporal grounded visual artifacts that reveal a video as machine
generated -- has been largely overlooked. We introduce DeeptraceReward, the
first fine-grained, spatially- and temporally- aware benchmark that annotates
human-perceived fake traces for video generation reward. The dataset comprises
4.3K detailed annotations across 3.3K high-quality generated videos. Each
annotation provides a natural-language explanation, pinpoints a bounding-box
region containing the perceived trace, and marks precise onset and offset
timestamps. We consolidate these annotations into 9 major categories of
deepfake traces that lead humans to identify a video as AI-generated, and train
multimodal language models (LMs) as reward models to mimic human judgments and
localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by
34.7% on average across fake clue identification, grounding, and explanation.
Interestingly, we observe a consistent difficulty gradient: binary fake v.s.
real classification is substantially easier than fine-grained deepfake trace
detection; within the latter, performance degrades from natural language
explanations (easiest), to spatial grounding, to temporal labeling (hardest).
By foregrounding human-perceived deepfake traces, DeeptraceReward provides a
rigorous testbed and training signal for socially aware and trustworthy video
generation.