影片真實性測試:AI生成的ASMR影片能騙過視覺語言模型與人類嗎?
Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?
December 15, 2025
作者: Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu, James Cheng, Wei Liu, Philip Torr, Kevin Qinghong Lin
cs.AI
摘要
近期影片生成技術的突破已能產出與真實影片難以區分的生動內容,使得AI生成影片檢測成為新興的社會挑戰。現有AIGC檢測基準多數僅評估無聲影片、針對寬泛的敘事領域,且側重於分類任務。然而,頂尖影片生成模型能否創造出足以欺騙人類與視覺語言模型的沉浸式音畫同步影片,仍是未解之謎。為此,我們提出「影片真實性測試」——一套基於ASMR來源的影片基準套件,用於在緊密音視覺耦合下測試感知真實性,其特色包含:(一)沉浸式ASMR音畫來源:以精心篩選的真實ASMR影片為基礎,針對細粒度動作-物件互動進行設計,涵蓋多元的物件、動作與背景;(二)同儕審查機制:採用對抗性創作者-審查者協議,影片生成模型作為試圖誤導審查者的創作者,而視覺語言模型則擔任識別偽造內容的審查者。實驗結果顯示:最佳創作者Veo3.1-Fast甚至能欺騙多數視覺語言模型——最強審查者(Gemini 2.5-Pro)僅達56%準確率(隨機基準為50%),遠低於人類專家表現(81.25%)。添加音頻雖有助於真假判別,但浮水印等表面線索仍會嚴重誤導模型。這些發現劃定了當前影片生成真實性的邊界,並揭露視覺語言模型在感知逼真度與音視覺一致性方面的局限。程式碼已開源於:https://github.com/video-reality-test/video-reality-test。
English
Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: (i) Immersive ASMR video-audio sources. Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. (ii) Peer-Review evaluation. An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56\% accuracy (random 50\%), far below that of human experts (81.25\%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.