SNAP: 音声ディープフェイク検出における人工産物投影のための話者無効化

要旨

近年のテキスト音声合成技術の進歩により、実音声とほとんど区別が付かない高精細な合成音声の生成が可能となっている。深層偽造検出において自己教師あり学習に基づく音声エンコーダの有効性が示されているが、これらのモデルは未見話者への汎化に課題を残す。定量的分析により、これらのエンコーダ表現が話者情報に大きく影響を受け、検出器がアーティファクト関連の手がかりではなく話者特有の相関に依存していることが明らかとなった。我々はこの現象を「話者エンタングルメント」と呼ぶ。この依存性を軽減するため、話者情報無効化フレームワークSNAPを提案する。話者部分空間を推定し、直交射影を適用することで話者依存成分を抑制し、残差特徴量内に合成アーティファクトを分離する。話者エンタングルメントの低減により、SNAPは検出器がアーティファクト関連パターンに注目することを促進し、State-of-the-artの性能を実現する。

English

Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.

SNAP: 音声ディープフェイク検出における人工産物投影のための話者無効化

SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection

要旨

Support