새로운 음성 스푸핑에 빠르게 적응하기: 분포 변화 하에서 합성 음성의 소수 샷 탐지

초록

우리는 훈련 데이터와 비교하여 보이지 않는 합성 방법, 화자, 언어 또는 오디오 조건으로 인해 발생하는 분포 변화(distribution shifts) 하에서 합성 음성을 탐지하는 과제를 다룹니다. 소수 샘플 학습(few-shot learning) 방법은 소수의 분포 내(in-distribution) 샘플을 기반으로 신속하게 적응함으로써 이러한 분포 변화를 해결할 수 있는 유망한 접근법입니다. 우리는 더 강력한 소수 샘플 적응을 가능하게 하는 자기 주의(self-attentive) 프로토타입 네트워크를 제안합니다. 우리의 접근 방식을 평가하기 위해, 전통적인 제로샷(zero-shot) 탐지기와 제안된 소수 샘플 탐지기의 성능을 체계적으로 비교하고, 평가 시점에 분포 변화를 도입하기 위해 훈련 조건을 신중하게 통제합니다. 분포 변화가 제로샷 성능을 저해하는 조건에서, 우리가 제안한 소수 샘플 적응 기술은 단 10개의 분포 내 샘플만을 사용하여 신속하게 적응할 수 있습니다. 이를 통해 일본어 딥페이크(deepfake) 데이터셋에서 최대 32%의 상대적 EER(Equal Error Rate) 감소를 달성했으며, ASVspoof 2021 딥페이크 데이터셋에서도 20%의 상대적 감소를 보였습니다.

English

We address the challenge of detecting synthesized speech under distribution shifts -- arising from unseen synthesis methods, speakers, languages, or audio conditions -- relative to the training data. Few-shot learning methods are a promising way to tackle distribution shifts by rapidly adapting on the basis of a few in-distribution samples. We propose a self-attentive prototypical network to enable more robust few-shot adaptation. To evaluate our approach, we systematically compare the performance of traditional zero-shot detectors and the proposed few-shot detectors, carefully controlling training conditions to introduce distribution shifts at evaluation time. In conditions where distribution shifts hamper the zero-shot performance, our proposed few-shot adaptation technique can quickly adapt using as few as 10 in-distribution samples -- achieving upto 32% relative EER reduction on deepfakes in Japanese language and 20% relative reduction on ASVspoof 2021 Deepfake dataset.

새로운 음성 스푸핑에 빠르게 적응하기: 분포 변화 하에서 합성 음성의 소수 샷 탐지

Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts

초록

Support