ChatPaper.aiChatPaper

视频真实性测试:AI生成的ASMR视频能否骗过视觉语言模型与人类?

Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

December 15, 2025
作者: Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu, James Cheng, Wei Liu, Philip Torr, Kevin Qinghong Lin
cs.AI

摘要

近期视频生成技术的突破性进展已能制作出与真实视频难以区分的生动内容,这使得AI生成视频检测成为新兴的社会挑战。现有AIGC检测基准大多针对无音频视频进行评估,面向宽泛的叙事领域,且仅聚焦于分类任务。然而,最先进的视频生成模型能否产出具有沉浸感、音画同步且能可靠欺骗人类与视觉语言模型(VLM)的音频配对视频,目前尚不明确。为此,我们推出"视频真实度测试"——一套基于ASMR音视频源的基准测试集,用于在严格音画耦合条件下检验感知真实度,其特色包括:(i)沉浸式ASMR音视频源。基于精心筛选的真实ASMR视频构建,该基准针对细粒度的动作-对象交互,在物体、动作及背景层面实现多样性覆盖。(ii)同行评审机制。采用对抗性创作-评审协议:视频生成模型作为创作者试图欺骗评审者,而VLM则担任识别伪造内容的评审者。实验结果表明:最佳创作者Veo3.1-Fast甚至能欺骗多数VLM——最强评审者(Gemini 2.5-Pro)仅达到56%的识别准确率(随机基准为50%),远低于人类专家水平(81.25%)。音频的加入能提升真假判别能力,但水印等表面线索仍会显著误导模型。这些发现界定了当前视频生成真实度的边界,并揭示了VLM在感知保真度与音画一致性方面的局限。代码已开源:https://github.com/video-reality-test/video-reality-test。
English
Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: (i) Immersive ASMR video-audio sources. Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. (ii) Peer-Review evaluation. An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56\% accuracy (random 50\%), far below that of human experts (81.25\%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.
PDF592December 18, 2025