快速适应新型语音欺骗：分布偏移下合成语音的少样本检测

摘要

我们致力于解决在分布偏移情况下检测合成语音的挑战——这种偏移源于训练数据中未见过的合成方法、说话者、语言或音频条件。少样本学习方法通过基于少量同分布样本的快速适应，为解决分布偏移问题提供了有前景的途径。我们提出了一种自注意力原型网络，以实现更稳健的少样本适应。为评估我们的方法，我们系统性地比较了传统零样本检测器与所提出的少样本检测器的性能，在评估时严格控制训练条件以引入分布偏移。在分布偏移严重影响零样本性能的情况下，我们提出的少样本适应技术仅需使用10个同分布样本即可快速适应——在日语深度伪造数据上实现了高达32%的相对等错误率降低，在ASVspoof 2021深度伪造数据集上实现了20%的相对降低。

English

We address the challenge of detecting synthesized speech under distribution shifts -- arising from unseen synthesis methods, speakers, languages, or audio conditions -- relative to the training data. Few-shot learning methods are a promising way to tackle distribution shifts by rapidly adapting on the basis of a few in-distribution samples. We propose a self-attentive prototypical network to enable more robust few-shot adaptation. To evaluate our approach, we systematically compare the performance of traditional zero-shot detectors and the proposed few-shot detectors, carefully controlling training conditions to introduce distribution shifts at evaluation time. In conditions where distribution shifts hamper the zero-shot performance, our proposed few-shot adaptation technique can quickly adapt using as few as 10 in-distribution samples -- achieving upto 32% relative EER reduction on deepfakes in Japanese language and 20% relative reduction on ASVspoof 2021 Deepfake dataset.

快速适应新型语音欺骗：分布偏移下合成语音的少样本检测

Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts

摘要

Support