SNAP: 음성 딥페이크 탐지를 위한 잡음 제거를 위한 화자 영상화 기법

초록

최근 텍스트-음성 변환 기술의 발전으로 실제 인간 음성과 구분하기 어려운 고품질 합성 음성 생성이 가능해졌다. 최근 연구에서 자기 지도 학습 기반 음성 인코더가 딥페이크 탐지에 효과적임이 밝혀졌으나, 이러한 모델들은 학습 과정에서 접하지 못한 화자에 대한 일반화 성능이 떨어진다. 우리의 정량적 분석 결과, 이러한 인코더 표현들이 화자 정보에 상당히 영향을 받아 탐지기가 인공물 관련 단서보다 화자별 상관관계를 활용하게 되는 것으로 나타났다. 우리는 이러한 현상을 화자 얽힘 현상이라고 명명한다. 이러한 의존성을 완화하기 위해 화자 정보 무효화 프레임워크인 SNAP를 제안한다. 우리는 화자 부분공간을 추정하고 직교 투영을 적용하여 화자 의존적 요소를 억제함으로써 잔차 특징 내에 합성 인공물을 분리한다. 화자 얽힘 현상을 감소시킴으로써 SNAP는 탐지기가 인공물 관련 패턴에 집중하도록 유도하여 최첨단 성능을 달성한다.

English

Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.

SNAP: 음성 딥페이크 탐지를 위한 잡음 제거를 위한 화자 영상화 기법

SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection

초록

Support