通过多模态大语言模型学习人类感知的AI生成视频虚假性
Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs
September 26, 2025
作者: Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, Keyush Shah, Chung Un Lee, Yejin Choi, James Zou, Dan Roth, Chris Callison-Burch
cs.AI
摘要
人类能否识别AI生成的(虚假)视频并给出具体依据?
尽管视频生成模型发展迅速,但一个关键维度——即人类能否在生成的视频中检测到深度伪造的痕迹,也就是那些揭示视频为机器生成的空间时间视觉伪影——在很大程度上被忽视了。我们推出了DeeptraceReward,这是首个细粒度、空间和时间感知的基准,它标注了人类感知到的视频生成奖励中的虚假痕迹。该数据集包含了对3.3K高质量生成视频的4.3K详细标注。每个标注都提供了自然语言解释,精确定位了包含感知痕迹的边界框区域,并标记了精确的开始和结束时间戳。我们将这些标注整合为9大类导致人类识别视频为AI生成的深度伪造痕迹,并训练多模态语言模型(LMs)作为奖励模型,以模仿人类的判断和定位。在DeeptraceReward上,我们的7B奖励模型在虚假线索识别、定位和解释方面平均比GPT-5高出34.7%。有趣的是,我们观察到一个一致的难度梯度:二分类的虚假与真实识别比细粒度的深度伪造痕迹检测要容易得多;在后者中,从自然语言解释(最容易)到空间定位,再到时间标注(最难),性能逐渐下降。通过突出人类感知的深度伪造痕迹,DeeptraceReward为社会意识和可信的视频生成提供了一个严格的测试平台和训练信号。
English
Can humans identify AI-generated (fake) videos and provide grounded reasons?
While video generation models have advanced rapidly, a critical dimension --
whether humans can detect deepfake traces within a generated video, i.e.,
spatiotemporal grounded visual artifacts that reveal a video as machine
generated -- has been largely overlooked. We introduce DeeptraceReward, the
first fine-grained, spatially- and temporally- aware benchmark that annotates
human-perceived fake traces for video generation reward. The dataset comprises
4.3K detailed annotations across 3.3K high-quality generated videos. Each
annotation provides a natural-language explanation, pinpoints a bounding-box
region containing the perceived trace, and marks precise onset and offset
timestamps. We consolidate these annotations into 9 major categories of
deepfake traces that lead humans to identify a video as AI-generated, and train
multimodal language models (LMs) as reward models to mimic human judgments and
localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by
34.7% on average across fake clue identification, grounding, and explanation.
Interestingly, we observe a consistent difficulty gradient: binary fake v.s.
real classification is substantially easier than fine-grained deepfake trace
detection; within the latter, performance degrades from natural language
explanations (easiest), to spatial grounding, to temporal labeling (hardest).
By foregrounding human-perceived deepfake traces, DeeptraceReward provides a
rigorous testbed and training signal for socially aware and trustworthy video
generation.