快速适应新型语音欺骗:分布偏移下合成语音的少样本检测
Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts
August 18, 2025
作者: Ashi Garg, Zexin Cai, Henry Li Xinyuan, Leibny Paola García-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews
cs.AI
摘要
我们致力于解决在分布偏移情况下检测合成语音的挑战——这种偏移源于训练数据中未见过的合成方法、说话者、语言或音频条件。少样本学习方法通过基于少量同分布样本的快速适应,为解决分布偏移问题提供了有前景的途径。我们提出了一种自注意力原型网络,以实现更稳健的少样本适应。为评估我们的方法,我们系统性地比较了传统零样本检测器与所提出的少样本检测器的性能,在评估时严格控制训练条件以引入分布偏移。在分布偏移严重影响零样本性能的情况下,我们提出的少样本适应技术仅需使用10个同分布样本即可快速适应——在日语深度伪造数据上实现了高达32%的相对等错误率降低,在ASVspoof 2021深度伪造数据集上实现了20%的相对降低。
English
We address the challenge of detecting synthesized speech under distribution
shifts -- arising from unseen synthesis methods, speakers, languages, or audio
conditions -- relative to the training data. Few-shot learning methods are a
promising way to tackle distribution shifts by rapidly adapting on the basis of
a few in-distribution samples. We propose a self-attentive prototypical network
to enable more robust few-shot adaptation. To evaluate our approach, we
systematically compare the performance of traditional zero-shot detectors and
the proposed few-shot detectors, carefully controlling training conditions to
introduce distribution shifts at evaluation time. In conditions where
distribution shifts hamper the zero-shot performance, our proposed few-shot
adaptation technique can quickly adapt using as few as 10 in-distribution
samples -- achieving upto 32% relative EER reduction on deepfakes in Japanese
language and 20% relative reduction on ASVspoof 2021 Deepfake dataset.