ExposeAnyone:个性化音频转表情扩散模型作为鲁棒的零样本人脸伪造检测器
ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors
January 5, 2026
作者: Kaede Shiohara, Toshihiko Yamasaki, Vladislav Golyanik
cs.AI
摘要
检测未知深度伪造操作仍是人脸伪造检测领域最具挑战性的难题之一。当前主流方法因过度依赖现有深度伪造或伪伪造数据的监督训练,导致对特定伪造模式过拟合,难以泛化至未知伪造类型。相比之下,自监督方法虽具备更强泛化潜力,但现有研究仅通过自监督难以学习判别性表征。本文提出ExposeAnyone——一种基于扩散模型的完全自监督方法,通过音频生成表情序列。其核心思想是:当模型通过参考集完成对特定对象的个性化适配后,可通过扩散重建误差计算可疑视频与个性化对象之间的身份距离,从而实现重点人物人脸伪造检测。大量实验表明:1)在DF-TIMIT、DFDCP、KoDF和IDForge数据集上,本方法平均AUC较之前最优方法提升4.22个百分点;2)本模型能有效检测Sora2生成视频(现有方法对此类视频检测效果不佳);3)本方法对模糊、压缩等干扰具有强鲁棒性,凸显了其在现实场景人脸伪造检测中的适用性。
English
Detecting unknown deepfake manipulations remains one of the most challenging problems in face forgery detection. Current state-of-the-art approaches fail to generalize to unseen manipulations, as they primarily rely on supervised training with existing deepfakes or pseudo-fakes, which leads to overfitting to specific forgery patterns. In contrast, self-supervised methods offer greater potential for generalization, but existing work struggles to learn discriminative representations only from self-supervision. In this paper, we propose ExposeAnyone, a fully self-supervised approach based on a diffusion model that generates expression sequences from audio. The key idea is, once the model is personalized to specific subjects using reference sets, it can compute the identity distances between suspected videos and personalized subjects via diffusion reconstruction errors, enabling person-of-interest face forgery detection. Extensive experiments demonstrate that 1) our method outperforms the previous state-of-the-art method by 4.22 percentage points in the average AUC on DF-TIMIT, DFDCP, KoDF, and IDForge datasets, 2) our model is also capable of detecting Sora2-generated videos, where the previous approaches perform poorly, and 3) our method is highly robust to corruptions such as blur and compression, highlighting the applicability in real-world face forgery detection.